Have you ever wondered how powerful Pandas are?
It was not until I created a Python data class in my Data Visualization and Analysis course for manipulating CSV files that I really marveled at the functionality of pandas. For those who are not familiar with pandas, it’s an open-source software library written for the Python programming language for ease of data analysis and manipulation. In this blog, I highlight some implementations in the python data class that could easily have been implemented using Pandas. To see the code, check out my project. Hopefully, you’ll get to appreciate pandas a little bit more after reading this blog.
I had to write a method that reads a .csv file and parses it to only store numeric columns of data in a 2D tabular format. Here’s the code in the read method:
self.filepath = filepath
with open(filepath,”r”) as csv_file:
reader = csv.reader(csv_file, delimiter = ‘,’)
#remove white spaces spaces in data
data = [[x.strip() for x in row] for row in reader]#convert data to a numpy array
self.data = np.array(data)#pick out numeric data only
arr_indx = np.where(self.data[1,:] == ‘numeric’)#change arr_indx from 2D to 1D
arr_indx = np.array(arr_indx).flatten()#ensure we have only numeric headers
self.headers = self.data[0,arr_indx].tolist()#create empy list for header indexes
header_indx = []#append length of arr_inx to header_indx
for i in range(len(arr_indx)):
header_indx.append(i)#zip header_indx with self.headers to form dictionary of header2col
d1 = zip(self.headers,header_indx)
self.header2col = dict(d1)#have self.data display only the data without headers and datatypes
self.data = self.data[2:,arr_indx]
#convert self.data to type float
self.data = self.data.astype(‘float64’)
Getting the column names from a CSV file.
With pandas, I didn’t have to write a method that reads a .csv file because it has an inbuilt function called read_csv() that does exactly that. Let’s see the same implementation but using pandas this time.
#load iris csv as pandas dataframe
data = pd.read_csv(‘filepath’)#print column names
print(data.columns)
Getting the first 5 data samples from a CSV file.
In the python data class, I had to convert data to a Numpy array and use the slicing functionality to get the first 5 data samples. Although slicing is relatively easy to implement, the pandas method for getting the first 5 data samples from a CSV file is even easier as seen below:
#Slicing-1st five data samples
self.data[0:5]#Pandas
self.data.head()
Selecting a subset of data corresponding to given column names.
With the python data class, getting a subset of data corresponding to a particular column head(s) involved the use of a for loop, an if-else statement and it also relied on the read method created above as shown below:
def select_data(self, headers, [])
col_indices = []
for i in headers:
col_indices.append(self.headers.index(i))if rows == []:
return self.data[:,col_indices]
else:
return self.data[np.ix_(rows,col_indices)]
Pandas made the work very easy. To get a subset of data given the column name, simply pass in the column names that you are interested in as follows:
#select subset of data
data_1 = data[[‘variable name’]]print(data_1)
Conclusion
Pandas is an amazing tool for preprocessing data. As an upcoming data scientist, I continue to marvel at the power of this tool. Since pandas is an open source library, you can easily access it here if you want to learn more. Thank you for reading and if you can, please leave some feedback.