Coding/Python
Pandas Tutorial
- -
728x90
반응형
Introduction to Pandas¶
Pandas provide two convenient data structures for storing and manipulating data--Series and DataFrame. A Series is similar to a one-dimensional array whereas a DataFrame is more similar to representing a matrix or a spreadsheet table.
In [1]:
import numpy as np
from pandas import Series
from pandas import DataFrame
import pandas as pd
%matplotlib inline
Series¶
- A Series object consists of a one-dimensional array of values, whose elements can be referenced using an index array.
- A Series object can be created from a list, a numpy array, or a Python dictionary.
- You can apply most of the numpy functions on the Series object.
In [2]:
s = Series([3.1, 2.4, -1.7, 0.2, -2.9, 4.5]) # creating a series from a list
print(s)
print('Values=', s.values) # display values of the Series
print('Index=', s.index) # display indices of the Series
0 3.1 1 2.4 2 -1.7 3 0.2 4 -2.9 5 4.5 dtype: float64 Values= [ 3.1 2.4 -1.7 0.2 -2.9 4.5] Index= RangeIndex(start=0, stop=6, step=1)
In [3]:
print(s.mean())
0.9333333333333332
In [4]:
s2 = Series(np.random.randn(6)) # creating a series from a numpy ndarray
print(s2)
print('Values=', s2.values) # display values of the Series
print('Index=', s2.index) # display indices of the Series
0 1.742451 1 1.962877 2 -0.061042 3 2.319547 4 -0.874027 5 0.669371 dtype: float64 Values= [ 1.74245094 1.96287667 -0.0610422 2.31954724 -0.87402656 0.66937117] Index= RangeIndex(start=0, stop=6, step=1)
In [5]:
s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6'])
print(s3)
print('Values=', s3.values) # display values of the Series
print('Index=', s3.index) # display indices of the Series
Jan 1 1.2 Jan 2 2.5 Jan 3 -2.2 Jan 4 3.1 Jan 5 -0.8 Jan 6 -3.2 dtype: float64 Values= [ 1.2 2.5 -2.2 3.1 -0.8 -3.2] Index= Index(['Jan 1', 'Jan 2', 'Jan 3', 'Jan 4', 'Jan 5', 'Jan 6'], dtype='object')
In [6]:
capitals = {'MI': 'Lansing', 'CA': 'Sacramento', 'TX': 'Austin', 'MN': 'St Paul'}
s4 = Series(capitals) # creating a series from dictionary object
print(s4)
print('Values=', s4.values) # display values of the Series
print('Index=', s4.index) # display indices of the Series
MI Lansing CA Sacramento TX Austin MN St Paul dtype: object Values= ['Lansing' 'Sacramento' 'Austin' 'St Paul'] Index= Index(['MI', 'CA', 'TX', 'MN'], dtype='object')
In [7]:
s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6',])
print(s3) # Accessing elements of a Series
print('\ns3[2]=', s3[2]) # display third element of the Series
print('s3[\'Jan 3\']=', s3['Jan 3']) # indexing element of a Series
print('\ns3[1:3]=') # display a slice of the Series
print(s3[1:3])
print('s3.iloc([1:3])=') # display a slice of the Series
print(s3.iloc[1:3])
Jan 1 1.2 Jan 2 2.5 Jan 3 -2.2 Jan 4 3.1 Jan 5 -0.8 Jan 6 -3.2 dtype: float64 s3[2]= -2.2 s3['Jan 3']= -2.2 s3[1:3]= Jan 2 2.5 Jan 3 -2.2 dtype: float64 s3.iloc([1:3])= Jan 2 2.5 Jan 3 -2.2 dtype: float64
In [8]:
print('shape =', s3.shape) # get the dimension of the Series
print('size =', s3.size) # get the # of elements of the Series
shape = (6,) size = 6
In [9]:
print(s3[s3 > 0]) # applying filter to select elements of the Series
Jan 1 1.2 Jan 2 2.5 Jan 4 3.1 dtype: float64
In [10]:
print(s3 + 4) # applying scalar operation on a numeric Series
print(s3 / 4)
Jan 1 5.2 Jan 2 6.5 Jan 3 1.8 Jan 4 7.1 Jan 5 3.2 Jan 6 0.8 dtype: float64 Jan 1 0.300 Jan 2 0.625 Jan 3 -0.550 Jan 4 0.775 Jan 5 -0.200 Jan 6 -0.800 dtype: float64
In [11]:
print(np.log(s3 + 4)) # applying numpy math functions to a numeric Series
Jan 1 1.648659 Jan 2 1.871802 Jan 3 0.587787 Jan 4 1.960095 Jan 5 1.163151 Jan 6 -0.223144 dtype: float64
DataFrame¶
- A DataFrame object is a tabular, spreadsheet-like data structure containing a collection of columns, each of which can be of different types (numeric, string, boolean, etc). Unlike Series, a DataFrame has distinct row and column indices. There are many ways to create a DataFrame object (e.g., from a dictionary, list of tuples, or even numpy's ndarrays).
- Series is the data structure for a single column of a DataFrame, not only conceptually, but literally i.e. the data in a DataFrame is actually stored in memory as a collection of Series.
In [12]:
cars = {
'make': ['Ford', 'Honda', 'Toyota', 'Tesla'],
'model': ['Taurus', 'Accord', 'Camry', 'Model S'],
'MSRP': [27595, 23570, 23495, 68000]
}
carData = DataFrame(cars) # creating DataFrame from dictionary
carData # display the table
Out[12]:
make | model | MSRP | |
---|---|---|---|
0 | Ford | Taurus | 27595 |
1 | Honda | Accord | 23570 |
2 | Toyota | Camry | 23495 |
3 | Tesla | Model S | 68000 |
In [13]:
print(carData.index) # print the row indices
print(carData.columns) # print the column indices
RangeIndex(start=0, stop=4, step=1) Index(['make', 'model', 'MSRP'], dtype='object')
In [14]:
carData2 = DataFrame(cars, index = [1,2,3,4]) # change the row index
#print(carData2)
carData2['year'] = 2018 # add column with same value
carData2['dealership'] = ['Courtesy Ford','Capital Honda','Spartan Toyota','N/A']
carData2 # display table carData2
Out[14]:
make | model | MSRP | year | dealership | |
---|---|---|---|---|---|
1 | Ford | Taurus | 27595 | 2018 | Courtesy Ford |
2 | Honda | Accord | 23570 | 2018 | Capital Honda |
3 | Toyota | Camry | 23495 | 2018 | Spartan Toyota |
4 | Tesla | Model S | 68000 | 2018 | N/A |
Creating DataFrame from a list of tuples.
In [15]:
tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
(2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames)
weatherData
Out[15]:
year | temp | precip | |
---|---|---|---|
0 | 2011 | 45.1 | 32.4 |
1 | 2012 | 42.4 | 34.5 |
2 | 2013 | 47.2 | 39.2 |
3 | 2014 | 44.2 | 31.4 |
4 | 2015 | 39.9 | 29.8 |
5 | 2016 | 41.5 | 36.7 |
In [16]:
print(weatherData['temp'].max())
print(weatherData['temp'].mean())
print(weatherData['temp'].std())
47.2 43.383333333333326 2.639254945371264
In [17]:
npdata = np.random.randn(5,3) # create a 5 by 3 random matrix
columnNames = ['x1','x2','x3']
data = DataFrame(npdata, columns=columnNames)
data
Out[17]:
x1 | x2 | x3 | |
---|---|---|---|
0 | -2.521277 | 0.144287 | -0.241128 |
1 | 0.002553 | 0.438148 | -0.602433 |
2 | -1.600139 | -0.119694 | -0.593759 |
3 | -2.014750 | -1.636102 | 0.577415 |
4 | -1.525359 | 0.795262 | 1.904494 |
The elements of a DataFrame can be accessed in many ways.
In [18]:
# accessing an entire column will return a Series object
print(data['x2'])
print(type(data['x2']))
0 0.144287 1 0.438148 2 -0.119694 3 -1.636102 4 0.795262 Name: x2, dtype: float64 <class 'pandas.core.series.Series'>
In [19]:
# accessing an entire row will return a Series object
print('Row 3 of data table:')
print(data.iloc[2]) # returns the 3rd row of DataFrame
print(type(data.iloc[2]))
print('\nRow 3 of car data table:')
print(carData2.iloc[2]) # row contains objects of different types
Row 3 of data table: x1 -1.600139 x2 -0.119694 x3 -0.593759 Name: 2, dtype: float64 <class 'pandas.core.series.Series'> Row 3 of car data table: make Toyota model Camry MSRP 23495 year 2018 dealership Spartan Toyota Name: 3, dtype: object
In [20]:
# accessing a specific element of the DataFrame
print(carData2.iloc[1,2]) # retrieving second row, third column
print(carData2.loc[1,'model']) # retrieving second row, column named 'model'
# accessing a slice of the DataFrame
print('carData2.iloc[1:3,1:3]=')
print(carData2.iloc[1:3,1:3])
23570 Taurus carData2.iloc[1:3,1:3]= model MSRP 2 Accord 23570 3 Camry 23495
In [21]:
print('carData2.shape =', carData2.shape)
print('carData2.size =', carData2.size)
carData2.shape = (4, 5) carData2.size = 20
In [22]:
# selection and filtering
print('carData2[carData2.MSRP > 25000]')
print(carData2[carData2.MSRP > 25000])
carData2[carData2.MSRP > 25000] make model MSRP year dealership 1 Ford Taurus 27595 2018 Courtesy Ford 4 Tesla Model S 68000 2018 N/A
Arithmetic Operations¶
In [23]:
print(data)
print('Data transpose operation:')
print(data.T) # transpose operation
print('Addition:')
print(data + 4) # addition operation
print('Multiplication:')
print(data * 10) # multiplication operation
x1 x2 x3 0 -2.521277 0.144287 -0.241128 1 0.002553 0.438148 -0.602433 2 -1.600139 -0.119694 -0.593759 3 -2.014750 -1.636102 0.577415 4 -1.525359 0.795262 1.904494 Data transpose operation: 0 1 2 3 4 x1 -2.521277 0.002553 -1.600139 -2.014750 -1.525359 x2 0.144287 0.438148 -0.119694 -1.636102 0.795262 x3 -0.241128 -0.602433 -0.593759 0.577415 1.904494 Addition: x1 x2 x3 0 1.478723 4.144287 3.758872 1 4.002553 4.438148 3.397567 2 2.399861 3.880306 3.406241 3 1.985250 2.363898 4.577415 4 2.474641 4.795262 5.904494 Multiplication: x1 x2 x3 0 -25.212771 1.442875 -2.411278 1 0.025525 4.381484 -6.024328 2 -16.001395 -1.196938 -5.937587 3 -20.147504 -16.361023 5.774146 4 -15.253592 7.952623 19.044936
In [24]:
print('data =')
print(data)
columnNames = ['x1','x2','x3']
data2 = DataFrame(np.random.randn(5,3), columns=columnNames)
print('\ndata2 =')
print(data2)
print('\ndata + data2 = ')
print(data.add(data2))
print('\ndata * data2 = ')
print(data.mul(data2))
data = x1 x2 x3 0 -2.521277 0.144287 -0.241128 1 0.002553 0.438148 -0.602433 2 -1.600139 -0.119694 -0.593759 3 -2.014750 -1.636102 0.577415 4 -1.525359 0.795262 1.904494 data2 = x1 x2 x3 0 -0.655159 0.870909 -0.009572 1 -0.467867 0.817790 1.626673 2 -0.503922 0.252201 -0.156795 3 -0.800549 2.161829 -1.774008 4 0.136732 1.356657 0.689143 data + data2 = x1 x2 x3 0 -3.176436 1.015196 -0.250700 1 -0.465314 1.255938 1.024240 2 -2.104062 0.132507 -0.750553 3 -2.815299 0.525727 -1.196593 4 -1.388627 2.151919 2.593637 data * data2 = x1 x2 x3 0 1.651838 0.125661 0.002308 1 -0.001194 0.358313 -0.979961 2 0.806346 -0.030187 0.093098 3 1.612907 -3.536974 -1.024338 4 -0.208566 1.078898 1.312469
In [25]:
print(data.abs()) # get the absolute value for each element
print('\nMaximum value per column:')
print(data.max()) # get maximum value for each column
print('\nMinimum value per row:')
print(data.min(axis=1)) # get minimum value for each row
print('\nSum of values per column:')
print(data.sum()) # get sum of values for each column
print('\nAverage value per row:')
print(data.mean(axis=1)) # get average value for each row
print('\nCalculate max - min per column')
f = lambda x: x.max() - x.min()
print(data.apply(f))
print('\nCalculate max - min per row')
f = lambda x: x.max() - x.min()
print(data.apply(f, axis=1))
x1 x2 x3 0 2.521277 0.144287 0.241128 1 0.002553 0.438148 0.602433 2 1.600139 0.119694 0.593759 3 2.014750 1.636102 0.577415 4 1.525359 0.795262 1.904494 Maximum value per column: x1 0.002553 x2 0.795262 x3 1.904494 dtype: float64 Minimum value per row: 0 -2.521277 1 -0.602433 2 -1.600139 3 -2.014750 4 -1.525359 dtype: float64 Sum of values per column: x1 -7.658974 x2 -0.378098 x3 1.044589 dtype: float64 Average value per row: 0 -0.872706 1 -0.053911 2 -0.771197 3 -1.024479 4 0.391466 dtype: float64 Calculate max - min per column x1 2.523830 x2 2.431365 x3 2.506926 dtype: float64 Calculate max - min per row 0 2.665565 1 1.040581 2 1.480446 3 2.592165 4 3.429853 dtype: float64
Plotting Series and DataFrame¶
There are built-in functions you can use to plot the data stored in a Series or a DataFrame.
In [26]:
s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2,1.4], index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6','Jan 7'])
s3.plot(kind='line', title='Line plot')
Out[26]:
<AxesSubplot:title={'center':'Line plot'}>
In [27]:
s3.plot(kind='bar', title='Bar plot')
Out[27]:
<AxesSubplot:title={'center':'Bar plot'}>
In [28]:
s3.plot(kind='hist', title = 'Histogram')
Out[28]:
<AxesSubplot:title={'center':'Histogram'}, ylabel='Frequency'>
In [29]:
tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
(2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames)
weatherData[['temp','precip']].plot(kind='box', title='Box plot')
Out[29]:
<AxesSubplot:title={'center':'Box plot'}>
728x90
반응형
'Coding > Python' 카테고리의 다른 글
Numpy Tutorial (0) | 2024.11.03 |
---|---|
Visualizing Data (0) | 2024.11.03 |
Crash Cource in Python (0) | 2024.11.03 |
FastAPI를 이용한 웹캠 스트리밍 서버 (0) | 2024.10.29 |
Numpy in Python (0) | 2024.09.10 |
Contents
소중한 공감 감사합니다