Page contents: Data AdditionsData CharacterizationData PersistencePrevious / Next

Data Cleaning code

The initial data table was investigated for redundant and missing data. Certain columns were dropped outright:

The following is an abbreviated demonstration of data investigation. The large amount of columns would make a complete demonstration unwieldy. All presented adjustments to the cat DataFrame were performed to generate the final data table.


	import requests as re
	import pandas as pd
	import numpy as np
	from bs4 import BeautifulSoup
	from pathlib import Path  # for save paths

	cat = pd.DataFrame(re.get("https://api.thecatapi.com/v1/breeds").json())
	
	# Total number of cats
	cat.shape[0] 
	> 67 
	
	# Number of null entries per column
	cat.isna().sum().sort_values(ascending=False) 
	> bidability            65
	> cat_friendly          60
	> vcahospitals_url      25
	> cfa_url               24
	> lap                   20
	> vetstreet_url         17
	> alt_names              4
	> image                  2
	> reference_image_id     2
	> wikipedia_url          1 
	
	# Note that some entries are blank strings and are not captured by "cat.isna()" or "cat.isnull()"
	# There are actually 24 cats without "alt_names" out of 67 total.
	cat[cat.alt_names == ''].shape[0] 
	> 20 
	
	# pandas has a parameter "na-filter" when reading in data that can be adjusted

	# country_codes is the same as country_code
	cat[cat.country_codes != cat.country_code].shape[0] 
	> 0	 
    

Contents of some columns were reformatted from dictionaries to raw values to ease future delivery. The weight column had imperial and metric units together, which were split into individual columns. Only the image URL was taken from the image column. *Image URLs now extracted from reference_image_id.


	for i,val in enumerate(cat.weight):
		cat.loc[i,'weight_lb'] = val['imperial'] # extract into new columns
		cat.loc[i,'weight_kg'] = val['metric']
	cat.drop(columns = 'weight', inplace = True) # delete original column

	for i,val in enumerate(cat.image):
		if type(val) == dict and 'url' in val:
			cat.loc[i,'image'] = val['url'] # image source, if available
		else:
			cat.loc[i,'image'] = '' # leave blank, if not
	


Data Additions

Cat Images

One missing entry in wikipedia_url was manually replaced. Then wikipedia pages were used to fill in missing images. Note that some portions of code are not generally applicable. Given the low number of entries to process, the use-cases are fairly specific.


	ind = cat[cat.wikipedia_url.isnull()].index
	cat.loc[ind,'name'].values[0] 
	> European Burmese 
	cat.loc[ind,'wikipedia_url'] = 'https://en.wikipedia.org/wiki/Burmese_cat'
	
	# Find Image from wikipedia page and add to image-less cats
	for i,val in enumerate(cat[cat.image == ''].index):
		URL = cat.loc[val,'wikipedia_url']
		page = re.get(URL)
		soup = BeautifulSoup(page.content, 'html.parser')
		all_links = soup.find_all('a',class_= 'image')
		if str([all_links[0]]).find('Question_book-new.svg') == -1: # One page has a question book icon for the 1st image
			link_info = str(all_links[0])
		else:
			link_info = str(all_links[1])
		fat = link_info[link_info.find('src='):link_info.find('srcset=')]

		if fat.find('.jpg') == -1: # Some extensions are written as .JPG instead of .jpg
			extension = '.JPG'
		else:
			extension = '.jpg'

		base = fat[fat.find('/thumb'.lower())+6:fat.find(extension)+4]
		image_link = f'https://upload.wikimedia.org/wikipedia/commons{base}'
		cat.loc[val,'image'] = image_link
	

Alpha-3 Country Codes

The 2 character country codes are not sufficient to map the data using Plotly. I found a table of country names and their codes, compared the cat DataFrame's country codes with the appropriate column on the table, then converted the cat's country_code to 3 characters. There were some mismatches with Iran (country name) and Singapore (alpha-2 code), hence the try/except/pass expressions. This code could be improved.


	countries = pd.read_csv(Path(Path.cwd(),'subfolder', 'countries.csv'), index_col = 'Country') # link to CSV file above
	
	# Adjust country_code from alpha-2 to 3
	for val in cat.index:
        CC = cat.loc[val,'country_code'] # Match alpha-2 codes, use to find alpha-3 code
        try:
            ind = countries[countries['Alpha-2 code'] == CC].index[0]
            cat.loc[val,'country_code'] = countries.loc[ind,'Alpha-3 code'].replace('"','').replace(' ','') # values have junk characters
        except:
            try: # If alpha-2 code not matched (Singapore), try matching country name.
                country = cat.loc[val,'origin']
                cat.loc[val,'country_code'] = countries.loc[country,'Alpha-3 code'].replace('"','').replace(' ','')
            except:
                pass
            pass
	


Data Characterization

Column types were determined to allow calls for specific types of data. Column names and indexes for each type of data were stored as separate pandas Series. Averages were calculated, where applicable. Check out the Data Exploration pages or a Cat Data page to dig deeper into the data.

Data Types:

Explore Data: Spans WordsStats


	# Replace DataFrame index to cat names. Establish pandas Series for data descriptors.
	cat.set_index('name', inplace = True)
	stats = pd.Series(dtype=int)
	words = pd.Series(dtype=object)
	binaries = pd.Series(dtype=int)
	spans = pd.Series(dtype=int)
	
	for i,val in enumerate(cat.columns):
		if cat[val].dtype == 'object':
			if cat[val][0].replace(' ', '').replace('-','').isdigit(): # Spans " - " format
				spans.loc[i] = val
				cat.loc['CatAPI Average',val] = '' # pre-allocate as string (object), if empty then nan value (float) will cause errors
				cat[val] = cat[val].apply(lambda span_string: span_string.replace(' ','')) # get rid of spaces
				# Average calculation
				lower = []
				upper = []
				for j, cal in enumerate(cat[val][:-1]):
					values = cal.split('-')
					lower.append(float(values[0]))
					upper.append(float(values[1]))
				lower_avg = round(np.mean(lower),1)
				upper_avg = round(np.mean(upper),1)
				cat.loc['CatAPI Average',val] = f'{lower_avg}-{upper_avg}'

			else:
				words.loc[i] = val # Words, no average

		else:
			if np.min(cat[val]) == 0: # Binaries range from 0-1
				binaries.loc[i] = val
				avg = round(np.mean(cat[val]),2)
				cat.loc['CatAPI Average',val] = avg
			else:
				stats.loc[i] = val # Stats range from 1-5
				avg = round(np.mean(cat[val]),2)
				cat.loc['CatAPI Average',val] = avg
	


Data Persistence

Pandas enables saving to many data types. As this data is simple and static*, I decided CSV, a human readable filetype, would be ideal. I also considered python pickles and a simple sqlite database. For large data sets, I've found Parquet files to be a great option for Pandas DataFrames (example usage).

The following code provides an example for saving and reading the main data table, cat, and a series, stats. Additionally, it shows how a random cat, and its stats may be selected.


	# Saving table examples
	cat.to_csv(Path(Path.cwd(),'subfolder','catdata.csv'))
	stats.to_csv(Path(Path.cwd(),'subfolder','stats.csv'))
	
	# Reading table examples, certain import parameters specified for future handling in Flask
	data = pd.read_csv(Path(Path.cwd(),'subfolder', 'catdata.csv'), index_col = 'name', na_filter = False) # na_filter keeps empty strings as object data types, useful later
	statsCol = pd.Series(pd.read_csv(Path(Path.cwd(),'subfolder','stats.csv'), index_col = 0)['0'], name = 'Cols')
	
	# Select random cat and its Stats (both loc and iloc methods applicable)
	random_cat = pd.Series(data.index[:-1]).sample().values[0] # last entry (average) is omitted from sample
	cat_stats = data.loc[random_cat,statsCol]
	cat_stats = data.iloc[random_cat,statsCol.index]
	cat_stats.name 
	> 'Bambino'
	cat_stats[0:3] 
	> adaptability       5.0
	> affection_level    5.0
	> child_friendly     4.0
	

*This site is designed to be able to call, clean, and save the CatAPI data whenever desired, but I haven't found the data to be updated over time. Therefore, this process is performed once, and subsequent calls are made to the CSV files. Likewise, some graphs are generated once, then saved. If I added a refresh data feature, then specific portions of the cleaning process should be made more general.


< Background previous | next Data Table >