Introduction to Pandas¶

Pandas provide two convenient data structures for storing and manipulating data--Series and DataFrame. A Series is similar to a one-dimensional array whereas a DataFrame is more similar to representing a matrix or a spreadsheet table.

In [1]:

import numpy as np
from pandas import Series
from pandas import DataFrame
import pandas as pd
%matplotlib inline

Series¶

A Series object consists of a one-dimensional array of values, whose elements can be referenced using an index array.
A Series object can be created from a list, a numpy array, or a Python dictionary.
You can apply most of the numpy functions on the Series object.

In [2]:

s = Series([3.1, 2.4, -1.7, 0.2, -2.9, 4.5])  # creating a series from a list
print(s)
print('Values=', s.values)  # display values of the Series
print('Index=', s.index)  # display indices of the Series

0    3.1
1    2.4
2   -1.7
3    0.2
4   -2.9
5    4.5
dtype: float64
Values= [ 3.1  2.4 -1.7  0.2 -2.9  4.5]
Index= RangeIndex(start=0, stop=6, step=1)

In [3]:

print(s.mean())

0.9333333333333332

In [4]:

s2 = Series(np.random.randn(6)) # creating a series from a numpy ndarray
print(s2)
print('Values=', s2.values)  # display values of the Series
print('Index=', s2.index)  # display indices of the Series

0    1.742451
1    1.962877
2   -0.061042
3    2.319547
4   -0.874027
5    0.669371
dtype: float64
Values= [ 1.74245094  1.96287667 -0.0610422   2.31954724 -0.87402656  0.66937117]
Index= RangeIndex(start=0, stop=6, step=1)

In [5]:

s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6'])
print(s3)
print('Values=', s3.values)  # display values of the Series
print('Index=', s3.index)  # display indices of the Series

Jan 1    1.2
Jan 2    2.5
Jan 3   -2.2
Jan 4    3.1
Jan 5   -0.8
Jan 6   -3.2
dtype: float64
Values= [ 1.2  2.5 -2.2  3.1 -0.8 -3.2]
Index= Index(['Jan 1', 'Jan 2', 'Jan 3', 'Jan 4', 'Jan 5', 'Jan 6'], dtype='object')

In [6]:

capitals = {'MI': 'Lansing', 'CA': 'Sacramento', 'TX': 'Austin', 'MN': 'St Paul'}
s4 = Series(capitals)  # creating a series from dictionary object
print(s4)
print('Values=', s4.values)  # display values of the Series
print('Index=', s4.index)  # display indices of the Series

MI       Lansing
CA    Sacramento
TX        Austin
MN       St Paul
dtype: object
Values= ['Lansing' 'Sacramento' 'Austin' 'St Paul']
Index= Index(['MI', 'CA', 'TX', 'MN'], dtype='object')

In [7]:

s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6',])
print(s3)  # Accessing elements of a Series
print('\ns3[2]=', s3[2])  # display third element of the Series
print('s3[\'Jan 3\']=', s3['Jan 3'])  # indexing element of a Series
print('\ns3[1:3]=')  # display a slice of the Series
print(s3[1:3])
print('s3.iloc([1:3])=')  # display a slice of the Series
print(s3.iloc[1:3])

Jan 1    1.2
Jan 2    2.5
Jan 3   -2.2
Jan 4    3.1
Jan 5   -0.8
Jan 6   -3.2
dtype: float64

s3[2]= -2.2
s3['Jan 3']= -2.2

s3[1:3]=
Jan 2    2.5
Jan 3   -2.2
dtype: float64
s3.iloc([1:3])=
Jan 2    2.5
Jan 3   -2.2
dtype: float64

In [8]:

print('shape =', s3.shape)  # get the dimension of the Series
print('size =', s3.size)  # get the # of elements of the Series

shape = (6,)
size = 6

In [9]:

print(s3[s3 > 0])  # applying filter to select elements of the Series

Jan 1    1.2
Jan 2    2.5
Jan 4    3.1
dtype: float64

In [10]:

print(s3 + 4)  # applying scalar operation on a numeric Series
print(s3 / 4)

Jan 1    5.2
Jan 2    6.5
Jan 3    1.8
Jan 4    7.1
Jan 5    3.2
Jan 6    0.8
dtype: float64
Jan 1    0.300
Jan 2    0.625
Jan 3   -0.550
Jan 4    0.775
Jan 5   -0.200
Jan 6   -0.800
dtype: float64

In [11]:

print(np.log(s3 + 4))  # applying numpy math functions to a numeric Series

Jan 1    1.648659
Jan 2    1.871802
Jan 3    0.587787
Jan 4    1.960095
Jan 5    1.163151
Jan 6   -0.223144
dtype: float64

DataFrame¶

A DataFrame object is a tabular, spreadsheet-like data structure containing a collection of columns, each of which can be of different types (numeric, string, boolean, etc). Unlike Series, a DataFrame has distinct row and column indices. There are many ways to create a DataFrame object (e.g., from a dictionary, list of tuples, or even numpy's ndarrays).
Series is the data structure for a single column of a DataFrame, not only conceptually, but literally i.e. the data in a DataFrame is actually stored in memory as a collection of Series.

In [12]:

cars = {
    'make': ['Ford', 'Honda', 'Toyota', 'Tesla'],
    'model': ['Taurus', 'Accord', 'Camry', 'Model S'],
    'MSRP': [27595, 23570, 23495, 68000]
}
carData = DataFrame(cars)  # creating DataFrame from dictionary
carData  # display the table

Out[12]:

	make	model	MSRP
0	Ford	Taurus	27595
1	Honda	Accord	23570
2	Toyota	Camry	23495
3	Tesla	Model S	68000

In [13]:

print(carData.index)  # print the row indices
print(carData.columns)  # print the column indices

RangeIndex(start=0, stop=4, step=1)
Index(['make', 'model', 'MSRP'], dtype='object')

In [14]:

carData2 = DataFrame(cars, index = [1,2,3,4]) # change the row index
#print(carData2)
carData2['year'] = 2018  # add column with same value
carData2['dealership'] = ['Courtesy Ford','Capital Honda','Spartan Toyota','N/A']
carData2  # display table  carData2

Out[14]:

	make	model	MSRP	year	dealership
1	Ford	Taurus	27595	2018	Courtesy Ford
2	Honda	Accord	23570	2018	Capital Honda
3	Toyota	Camry	23495	2018	Spartan Toyota
4	Tesla	Model S	68000	2018	N/A

Creating DataFrame from a list of tuples.

In [15]:

tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
             (2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames)
weatherData

Out[15]:

	year	temp	precip
0	2011	45.1	32.4
1	2012	42.4	34.5
2	2013	47.2	39.2
3	2014	44.2	31.4
4	2015	39.9	29.8
5	2016	41.5	36.7

In [16]:

print(weatherData['temp'].max())
print(weatherData['temp'].mean())
print(weatherData['temp'].std())

47.2
43.383333333333326
2.639254945371264

In [17]:

npdata = np.random.randn(5,3) # create a 5 by 3 random matrix
columnNames = ['x1','x2','x3']
data = DataFrame(npdata, columns=columnNames)
data

Out[17]:

	x1	x2	x3
0	-2.521277	0.144287	-0.241128
1	0.002553	0.438148	-0.602433
2	-1.600139	-0.119694	-0.593759
3	-2.014750	-1.636102	0.577415
4	-1.525359	0.795262	1.904494

The elements of a DataFrame can be accessed in many ways.

In [18]:

# accessing an entire column will return a Series object

print(data['x2'])
print(type(data['x2']))

0    0.144287
1    0.438148
2   -0.119694
3   -1.636102
4    0.795262
Name: x2, dtype: float64
<class 'pandas.core.series.Series'>

In [19]:

# accessing an entire row will return a Series object

print('Row 3 of data table:')
print(data.iloc[2])  # returns the 3rd row of DataFrame
print(type(data.iloc[2]))
print('\nRow 3 of car data table:')
print(carData2.iloc[2])  # row contains objects of different types

Row 3 of data table:
x1   -1.600139
x2   -0.119694
x3   -0.593759
Name: 2, dtype: float64
<class 'pandas.core.series.Series'>

Row 3 of car data table:
make                  Toyota
model                  Camry
MSRP                   23495
year                    2018
dealership    Spartan Toyota
Name: 3, dtype: object

In [20]:

# accessing a specific element of the DataFrame
print(carData2.iloc[1,2])  # retrieving second row, third column
print(carData2.loc[1,'model']) # retrieving second row, column named 'model'

# accessing a slice of the DataFrame
print('carData2.iloc[1:3,1:3]=')
print(carData2.iloc[1:3,1:3])

23570
Taurus
carData2.iloc[1:3,1:3]=
    model   MSRP
2  Accord  23570
3   Camry  23495

In [21]:

print('carData2.shape =', carData2.shape)
print('carData2.size =', carData2.size)

carData2.shape = (4, 5)
carData2.size = 20

In [22]:

# selection and filtering

print('carData2[carData2.MSRP > 25000]')
print(carData2[carData2.MSRP > 25000])

carData2[carData2.MSRP > 25000]
    make    model   MSRP  year     dealership
1   Ford   Taurus  27595  2018  Courtesy Ford
4  Tesla  Model S  68000  2018            N/A

Arithmetic Operations¶

In [23]:

print(data)

print('Data transpose operation:')
print(data.T)  # transpose operation

print('Addition:')
print(data + 4)  # addition operation

print('Multiplication:')
print(data * 10)  # multiplication operation

         x1        x2        x3
0 -2.521277  0.144287 -0.241128
1  0.002553  0.438148 -0.602433
2 -1.600139 -0.119694 -0.593759
3 -2.014750 -1.636102  0.577415
4 -1.525359  0.795262  1.904494
Data transpose operation:
           0         1         2         3         4
x1 -2.521277  0.002553 -1.600139 -2.014750 -1.525359
x2  0.144287  0.438148 -0.119694 -1.636102  0.795262
x3 -0.241128 -0.602433 -0.593759  0.577415  1.904494
Addition:
         x1        x2        x3
0  1.478723  4.144287  3.758872
1  4.002553  4.438148  3.397567
2  2.399861  3.880306  3.406241
3  1.985250  2.363898  4.577415
4  2.474641  4.795262  5.904494
Multiplication:
          x1         x2         x3
0 -25.212771   1.442875  -2.411278
1   0.025525   4.381484  -6.024328
2 -16.001395  -1.196938  -5.937587
3 -20.147504 -16.361023   5.774146
4 -15.253592   7.952623  19.044936

In [24]:

print('data =')
print(data)

columnNames = ['x1','x2','x3']
data2 = DataFrame(np.random.randn(5,3), columns=columnNames)
print('\ndata2 =')
print(data2)

print('\ndata + data2 = ')
print(data.add(data2))

print('\ndata * data2 = ')
print(data.mul(data2))

data =
         x1        x2        x3
0 -2.521277  0.144287 -0.241128
1  0.002553  0.438148 -0.602433
2 -1.600139 -0.119694 -0.593759
3 -2.014750 -1.636102  0.577415
4 -1.525359  0.795262  1.904494

data2 =
         x1        x2        x3
0 -0.655159  0.870909 -0.009572
1 -0.467867  0.817790  1.626673
2 -0.503922  0.252201 -0.156795
3 -0.800549  2.161829 -1.774008
4  0.136732  1.356657  0.689143

data + data2 = 
         x1        x2        x3
0 -3.176436  1.015196 -0.250700
1 -0.465314  1.255938  1.024240
2 -2.104062  0.132507 -0.750553
3 -2.815299  0.525727 -1.196593
4 -1.388627  2.151919  2.593637

data * data2 = 
         x1        x2        x3
0  1.651838  0.125661  0.002308
1 -0.001194  0.358313 -0.979961
2  0.806346 -0.030187  0.093098
3  1.612907 -3.536974 -1.024338
4 -0.208566  1.078898  1.312469

In [25]:

print(data.abs())  # get the absolute value for each element

print('\nMaximum value per column:')
print(data.max())  # get maximum value for each column

print('\nMinimum value per row:')
print(data.min(axis=1))  # get minimum value for each row

print('\nSum of values per column:')
print(data.sum())  # get sum of values for each column

print('\nAverage value per row:')
print(data.mean(axis=1))  # get average value for each row

print('\nCalculate max - min per column')
f = lambda x: x.max() - x.min()
print(data.apply(f))

print('\nCalculate max - min per row')
f = lambda x: x.max() - x.min()
print(data.apply(f, axis=1))

         x1        x2        x3
0  2.521277  0.144287  0.241128
1  0.002553  0.438148  0.602433
2  1.600139  0.119694  0.593759
3  2.014750  1.636102  0.577415
4  1.525359  0.795262  1.904494

Maximum value per column:
x1    0.002553
x2    0.795262
x3    1.904494
dtype: float64

Minimum value per row:
0   -2.521277
1   -0.602433
2   -1.600139
3   -2.014750
4   -1.525359
dtype: float64

Sum of values per column:
x1   -7.658974
x2   -0.378098
x3    1.044589
dtype: float64

Average value per row:
0   -0.872706
1   -0.053911
2   -0.771197
3   -1.024479
4    0.391466
dtype: float64

Calculate max - min per column
x1    2.523830
x2    2.431365
x3    2.506926
dtype: float64

Calculate max - min per row
0    2.665565
1    1.040581
2    1.480446
3    2.592165
4    3.429853
dtype: float64

Plotting Series and DataFrame¶

There are built-in functions you can use to plot the data stored in a Series or a DataFrame.

In [26]:

s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2,1.4], index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6','Jan 7'])
s3.plot(kind='line', title='Line plot')

Out[26]:

<AxesSubplot:title={'center':'Line plot'}>

In [27]:

s3.plot(kind='bar', title='Bar plot')

Out[27]:

<AxesSubplot:title={'center':'Bar plot'}>

In [28]:

s3.plot(kind='hist', title = 'Histogram')

Out[28]:

<AxesSubplot:title={'center':'Histogram'}, ylabel='Frequency'>

In [29]:

tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
             (2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames)
weatherData[['temp','precip']].plot(kind='box', title='Box plot')

Out[29]:

<AxesSubplot:title={'center':'Box plot'}>

Joining two dataframes is out of scope.¶

You will unserstand these operations very well after learning SQL.

Numpy Tutorial (0)	2024.11.03
Visualizing Data (0)	2024.11.03
Crash Cource in Python (0)	2024.11.03
FastAPI를 이용한 웹캠 스트리밍 서버 (0)	2024.10.29
Numpy in Python (0)	2024.09.10

새소식

인기 검색어

Pandas Tutorial

Introduction to Pandas¶

Series¶

DataFrame¶

Arithmetic Operations¶

Plotting Series and DataFrame¶

Joining two dataframes is out of scope.¶

You will find almost identical operations in SPARK dataframe¶

'Coding > Python' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바