Page contents: Data Additions • Data Characterization • Data Persistence • Previous / Next
The initial data table was investigated for redundant and missing data. Certain columns were dropped outright:
The following is an abbreviated demonstration of data investigation. The large amount of columns would make a complete demonstration unwieldy. All presented adjustments to the cat DataFrame were performed to generate the final data table.
import requests as re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from pathlib import Path # for save paths
cat = pd.DataFrame(re.get("https://api.thecatapi.com/v1/breeds").json())
# Total number of cats
cat.shape[0]
> 67
# Number of null entries per column
cat.isna().sum().sort_values(ascending=False)
> bidability 65
> cat_friendly 60
> vcahospitals_url 25
> cfa_url 24
> lap 20
> vetstreet_url 17
> alt_names 4
> image 2
> reference_image_id 2
> wikipedia_url 1
# Note that some entries are blank strings and are not captured by "cat.isna()" or "cat.isnull()"
# There are actually 24 cats without "alt_names" out of 67 total.
cat[cat.alt_names == ''].shape[0]
> 20
# pandas has a parameter "na-filter" when reading in data that can be adjusted
# country_codes is the same as country_code
cat[cat.country_codes != cat.country_code].shape[0]
> 0
Contents of some columns were reformatted from dictionaries to raw values to ease future delivery. The weight column had imperial and metric units together, which were split into individual columns. Only the image URL was taken from the image column. *Image URLs now extracted from reference_image_id.
for i,val in enumerate(cat.weight):
cat.loc[i,'weight_lb'] = val['imperial'] # extract into new columns
cat.loc[i,'weight_kg'] = val['metric']
cat.drop(columns = 'weight', inplace = True) # delete original column
for i,val in enumerate(cat.image):
if type(val) == dict and 'url' in val:
cat.loc[i,'image'] = val['url'] # image source, if available
else:
cat.loc[i,'image'] = '' # leave blank, if not
One missing entry in wikipedia_url was manually replaced. Then wikipedia pages were used to fill in missing images. Note that some portions of code are not generally applicable. Given the low number of entries to process, the use-cases are fairly specific.
ind = cat[cat.wikipedia_url.isnull()].index
cat.loc[ind,'name'].values[0]
> European Burmese
cat.loc[ind,'wikipedia_url'] = 'https://en.wikipedia.org/wiki/Burmese_cat'
# Find Image from wikipedia page and add to image-less cats
for i,val in enumerate(cat[cat.image == ''].index):
URL = cat.loc[val,'wikipedia_url']
page = re.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
all_links = soup.find_all('a',class_= 'image')
if str([all_links[0]]).find('Question_book-new.svg') == -1: # One page has a question book icon for the 1st image
link_info = str(all_links[0])
else:
link_info = str(all_links[1])
fat = link_info[link_info.find('src='):link_info.find('srcset=')]
if fat.find('.jpg') == -1: # Some extensions are written as .JPG instead of .jpg
extension = '.JPG'
else:
extension = '.jpg'
base = fat[fat.find('/thumb'.lower())+6:fat.find(extension)+4]
image_link = f'https://upload.wikimedia.org/wikipedia/commons{base}'
cat.loc[val,'image'] = image_link
The 2 character country codes are not sufficient to map the data using Plotly. I found a table of country names and their codes, compared the cat DataFrame's country codes with the appropriate column on the table, then converted the cat's country_code to 3 characters. There were some mismatches with Iran (country name) and Singapore (alpha-2 code), hence the try/except/pass expressions. This code could be improved.
countries = pd.read_csv(Path(Path.cwd(),'subfolder', 'countries.csv'), index_col = 'Country') # link to CSV file above
# Adjust country_code from alpha-2 to 3
for val in cat.index:
CC = cat.loc[val,'country_code'] # Match alpha-2 codes, use to find alpha-3 code
try:
ind = countries[countries['Alpha-2 code'] == CC].index[0]
cat.loc[val,'country_code'] = countries.loc[ind,'Alpha-3 code'].replace('"','').replace(' ','') # values have junk characters
except:
try: # If alpha-2 code not matched (Singapore), try matching country name.
country = cat.loc[val,'origin']
cat.loc[val,'country_code'] = countries.loc[country,'Alpha-3 code'].replace('"','').replace(' ','')
except:
pass
pass
Column types were determined to allow calls for specific types of data. Column names and indexes for each type of data were stored as separate pandas Series. Averages were calculated, where applicable. Check out the Data Exploration pages or a Cat Data page to dig deeper into the data.
Data Types:Explore Data: Spans • Words • Stats
# Replace DataFrame index to cat names. Establish pandas Series for data descriptors.
cat.set_index('name', inplace = True)
stats = pd.Series(dtype=int)
words = pd.Series(dtype=object)
binaries = pd.Series(dtype=int)
spans = pd.Series(dtype=int)
for i,val in enumerate(cat.columns):
if cat[val].dtype == 'object':
if cat[val][0].replace(' ', '').replace('-','').isdigit(): # Spans " - " format
spans.loc[i] = val
cat.loc['CatAPI Average',val] = '' # pre-allocate as string (object), if empty then nan value (float) will cause errors
cat[val] = cat[val].apply(lambda span_string: span_string.replace(' ','')) # get rid of spaces
# Average calculation
lower = []
upper = []
for j, cal in enumerate(cat[val][:-1]):
values = cal.split('-')
lower.append(float(values[0]))
upper.append(float(values[1]))
lower_avg = round(np.mean(lower),1)
upper_avg = round(np.mean(upper),1)
cat.loc['CatAPI Average',val] = f'{lower_avg}-{upper_avg}'
else:
words.loc[i] = val # Words, no average
else:
if np.min(cat[val]) == 0: # Binaries range from 0-1
binaries.loc[i] = val
avg = round(np.mean(cat[val]),2)
cat.loc['CatAPI Average',val] = avg
else:
stats.loc[i] = val # Stats range from 1-5
avg = round(np.mean(cat[val]),2)
cat.loc['CatAPI Average',val] = avg
Pandas enables saving to many data types. As this data is simple and static*, I decided CSV, a human readable filetype, would be ideal.
I also considered python pickles and a simple sqlite database.
For large data sets, I've found Parquet files to be a great option for Pandas DataFrames
(example usage).
The following code provides an example for saving and reading the main data table, cat, and a series, stats. Additionally, it shows how a random cat, and its stats may be selected.
# Saving table examples
cat.to_csv(Path(Path.cwd(),'subfolder','catdata.csv'))
stats.to_csv(Path(Path.cwd(),'subfolder','stats.csv'))
# Reading table examples, certain import parameters specified for future handling in Flask
data = pd.read_csv(Path(Path.cwd(),'subfolder', 'catdata.csv'), index_col = 'name', na_filter = False) # na_filter keeps empty strings as object data types, useful later
statsCol = pd.Series(pd.read_csv(Path(Path.cwd(),'subfolder','stats.csv'), index_col = 0)['0'], name = 'Cols')
# Select random cat and its Stats (both loc and iloc methods applicable)
random_cat = pd.Series(data.index[:-1]).sample().values[0] # last entry (average) is omitted from sample
cat_stats = data.loc[random_cat,statsCol]
cat_stats = data.iloc[random_cat,statsCol.index]
cat_stats.name
> 'Bambino'
cat_stats[0:3]
> adaptability 5.0
> affection_level 5.0
> child_friendly 4.0
*This site is designed to be able to call, clean, and save the CatAPI data whenever desired, but I haven't found the data to be updated over time. Therefore, this process is performed once, and subsequent calls are made to the CSV files. Likewise, some graphs are generated once, then saved. If I added a refresh data feature, then specific portions of the cleaning process should be made more general.
< Background previous | next Data Table >