Data Cleaning

The objective of this section is to carry out some data cleaning procedures on the previously collected text and record data. The datasets used here are the text data collected from News API and the record data collected from Rapid API and Cars API.

Cleaning Text Data

Importing the required python packages:

Show the code

API_KEY='481b1e4a75874d2f9a23e3329031364c'

Show the code

import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Topic : Electric Vehicles

A quick overview into the intial shape of the json returned from the api call

Show the code

baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

# THIS CODE WILL NOT WORK UNLESS YOU INSERT YOUR API KEY IN THE NEXT LINE
API_KEY='481b1e4a75874d2f9a23e3329031364c'
TOPIC1='electric vehicles'

Form URL and save result

Show the code

URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC1,
            'sortBy': 'relevancy',
            'totalRequests': 1}

print(baseURL)
# print(URLpost)

#GET DATA FROM API
response = requests.get(baseURL, URLpost) #request data from the server
# print(response.url);  
response = response.json() #extract txt data from request into json

# PRETTY PRINT
# https://www.digitalocean.com/community/tutorials/python-pretty-print-json

# print(json.dumps(response, indent=2));
print(json.dumps(response['articles'][:5], indent=2))
 

# #GET TIMESTAMP FOR PULL REQUEST
from datetime import datetime
timestamp = datetime.now().strftime("%Y-%m-%d-H%H-M%M-S%S")

# SAVE TO FILE 
with open(timestamp+'-newapi-raw-data.json', 'w') as outfile:
    json.dump(response, outfile, indent=4);

https://newsapi.org/v2/everything?
[
  {
    "source": {
      "id": "engadget",
      "name": "Engadget"
    },
    "author": "Kris Holt",
    "title": "Lucid EVs will be able to access Tesla's Superchargers starting in 2025",
    "description": "Lucid's electric vehicles will be able to plug into over 15,000 Tesla Superchargers in North America starting in 2025. The automaker is the latest entry in the growing list of companies pledging to support the North American Charging Standard (NACS), also kno\u2026",
    "url": "https://www.engadget.com/lucid-evs-will-be-able-to-access-teslas-superchargers-starting-in-2025-055045292.html",
    "urlToImage": "https://s.yimg.com/ny/api/res/1.2/cLamIJynTVqTuxElMIvb0g--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD04MDA-/https://s.yimg.com/os/creatr-uploaded-images/2023-11/52b603b0-7d2b-11ee-9eff-5b10c26861ec",
    "publishedAt": "2023-11-07T05:50:45Z",
    "content": "Lucid's electric vehicles will be able to plug into over 15,000 Tesla Superchargers in North America starting in 2025. The automaker is the latest entry in the growing list of companies pledging to s\u2026 [+1591 chars]"
  },
  {
    "source": {
      "id": "engadget",
      "name": "Engadget"
    },
    "author": "Mariella Moon",
    "title": "Revel is shutting down its shared electric moped service",
    "description": "Revel is leaving behind its roots and ending its (at times controversial) electric moped service New York City and San Francisco. Company CEO and co-founder Frank Reig has sent a company-wide email, viewed by TechCrunch, telling staff that \"the service has be\u2026",
    "url": "https://www.engadget.com/revel-is-shutting-down-its-shared-electric-moped-service-113046594.html",
    "urlToImage": "https://s.yimg.com/ny/api/res/1.2/L52ZFOIJv97OLl8rjENH1w--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD04MDA-/https://s.yimg.com/os/creatr-uploaded-images/2023-11/485a0370-7afc-11ee-b5f6-afb3423667dc",
    "publishedAt": "2023-11-04T11:30:46Z",
    "content": "Revel is leaving behind its roots and ending its (at times controversial) electric moped service New York City and San Francisco. Company CEO and co-founder Frank Reig has sent a company-wide email, \u2026 [+1490 chars]"
  },
  {
    "source": {
      "id": "engadget",
      "name": "Engadget"
    },
    "author": "Will Shanklin",
    "title": "EVs are way more unreliable than gas-powered cars, Consumer Reports data indicates",
    "description": "Consumer Reports has published an extensive ranking of vehicle reliability, and the results pour cold water on the dependability of EVs and plug-in hybrids. The survey says electric vehicles suffer from 79 percent more maintenance issues than gas- or diesel-p\u2026",
    "url": "https://www.engadget.com/evs-are-way-more-unreliable-than-gas-powered-cars-consumer-reports-data-indicates-212216581.html",
    "urlToImage": "https://s.yimg.com/ny/api/res/1.2/iDms4Dm4pNHUv2wjvZ3MkQ--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD03MzM-/https://s.yimg.com/os/creatr-uploaded-images/2023-11/c01cae10-8ef9-11ee-9d77-64bb3de436c7",
    "publishedAt": "2023-11-29T21:22:16Z",
    "content": "Consumer Reports has published an extensive ranking of vehicle reliability, and the results pour cold water on the dependability of EVs and plug-in hybrids. The survey says electric vehicles suffer f\u2026 [+3335 chars]"
  },
  {
    "source": {
      "id": "wired",
      "name": "Wired"
    },
    "author": "Adrienne So",
    "title": "Black Friday Deals on Electric Bikes (2023): Rad and Aventon",
    "description": "Electric bike deals are rolling for Black Friday. Pick up your new neighborhood ride from Aventon, Rad Power, Jackrabbit, or Specialized.",
    "url": "https://www.wired.com/story/best-electric-bike-deals-2023/",
    "urlToImage": "https://media.wired.com/photos/6556c91d3eaea30ab934eeea/191:100/w_1280,c_limit/Black-Friday-Cyber-Monday-Best-E-Bike-Deals.jpg",
    "publishedAt": "2023-11-18T13:00:00Z",
    "content": "We here at WIRED are big fans of bikes, electric bikes, bike accessories, and any vehicles, policies, or infrastructure that advance active transportation. Getting people more active as they go about\u2026 [+3354 chars]"
  },
  {
    "source": {
      "id": "the-verge",
      "name": "The Verge"
    },
    "author": "Andrew J. Hawkins",
    "title": "Tesla Cybertruck will usher in a new \u2018Powershare\u2019 bidirectional charging feature",
    "description": "Tesla\u2019s Cybertruck will be the company\u2019s first vehicle to feature vehicle-to-load, or bidirectional charging. That allows customers to charge equipment, another EV, or even power their whole home from their Cybertruck.",
    "url": "https://www.theverge.com/2023/11/30/23983226/tesla-cybertruck-powershare-bidirectional-vehicle-to-load",
    "urlToImage": "https://cdn.vox-cdn.com/thumbor/b8pqGPSF6FhbjfA_Uv-DGznEBR4=/0x0:2226x948/1200x628/filters:focal(1113x474:1114x475)/cdn.vox-cdn.com/uploads/chorus_asset/file/25123625/Screen_Shot_2023_11_30_at_4.17.14_PM.png",
    "publishedAt": "2023-11-30T21:47:26Z",
    "content": "Tesla Cybertruck will usher in a new Powershare bidirectional charging feature\r\nTesla Cybertruck will usher in a new Powershare bidirectional charging feature\r\n / The EV maker finally jumps on the ve\u2026 [+2497 chars]"
  }
]

Utility function

Function to clean strings

Show the code

def string_cleaner(input_string):
    try: 
        out=re.sub(r"""
                    [,.;@#?!&$-]+  # Accept one or more copies of punctuation
                    \ *           # plus zero or more copies of a space,
                    """,
                    " ",          # and replace it with a single space
                    input_string, flags=re.VERBOSE)

        #REPLACE SELECT CHARACTERS WITH NOTHING
        out = re.sub('[’.]+', '', input_string)

        #ELIMINATE DUPLICATE WHITESPACES USING WILDCARDS
        out = re.sub(r'\s+', ' ', out)

        #CONVERT TO LOWER CASE
        out=out.lower()
    except:
        print("ERROR")
        out=''
    return out

Clean JSON

clean data and make a list of lists

Show the code

import json
import re

# Assuming `response` is your JSON response containing the articles
article_list = response['articles']  # list of dictionaries for each article
article_keys = article_list[0].keys()
print("AVAILABLE KEYS:")
print(article_keys)

# Set how many articles to preview
number_of_articles_to_preview = 5  # Adjust as needed
verbose = True  # Set to False to suppress verbose output

def string_cleaner(input_string):
    # Define your string cleaning function here
    return input_string

electric_vehicle_cleaned_data = []

for index, article in enumerate(article_list):
    tmp = []

    # Print a preview for a limited number of articles
    if index < number_of_articles_to_preview and verbose:
        print(f"Article #{index} Preview:")
        print(f"Title: {article.get('title', 'No Title')}")
        # Add any other key information you want in the preview
        print("------------------------------------------")

    for key in article_keys:
        if key == 'source':
            src = string_cleaner(article[key]['name'])
            tmp.append(src)

        if key == 'author':
            author = string_cleaner(article[key]) if article[key] else 'NA'  # Ensure author is not None
            # ERROR CHECK (SOMETIMES AUTHOR IS SAME AS PUBLICATION)
            if author != 'NA' and src in author: 
                print("AUTHOR ERROR:", author)
                author = 'NA'
            tmp.append(author)

        if key == 'title':
            tmp.append(string_cleaner(article[key]))

        if key == 'description':
            tmp.append(string_cleaner(article[key]))

        if key == 'content':
            tmp.append(string_cleaner(article[key]))

        if key == 'publishedAt':
            # DEFINE DATA PATTERN FOR RE TO CHECK
            ref = re.compile('.*-.*-.*T.*:.*:.*Z')
            date = article[key]
            if not ref.match(date):
                print("DATE ERROR:", date)
                date = "NA"
            tmp.append(date)

    electric_vehicle_cleaned_data.append(tmp)

AVAILABLE KEYS:
dict_keys(['source', 'author', 'title', 'description', 'url', 'urlToImage', 'publishedAt', 'content'])
Article #0 Preview:
Title: Lucid EVs will be able to access Tesla's Superchargers starting in 2025
------------------------------------------
Article #1 Preview:
Title: Revel is shutting down its shared electric moped service
------------------------------------------
Article #2 Preview:
Title: EVs are way more unreliable than gas-powered cars, Consumer Reports data indicates
------------------------------------------
Article #3 Preview:
Title: Black Friday Deals on Electric Bikes (2023): Rad and Aventon
------------------------------------------
Article #4 Preview:
Title: Tesla Cybertruck will usher in a new ‘Powershare’ bidirectional charging feature
------------------------------------------

Convert to Dataframe

Show the code

df = pd.DataFrame(electric_vehicle_cleaned_data)
print(df)
df.to_csv('electric_vehicle_cleaned.csv', index=False) 
index_label=['title','src','author','date','description']

                          0                         1  \
0                  Engadget                 Kris Holt   
1                  Engadget             Mariella Moon   
2                  Engadget             Will Shanklin   
3                     Wired               Adrienne So   
4                 The Verge         Andrew J. Hawkins   
..                      ...                       ...   
95                 Autoblog                 Bloomberg   
96  Futurity: Research News  Jim Erickson-U. Michigan   
97               Core77.com                  Rain Noe   
98           Digital Trends       Christian de Looper   
99             The Next Web            Linnea Ahlgren   

                                                    2  \
0   Lucid EVs will be able to access Tesla's Super...   
1   Revel is shutting down its shared electric mop...   
2   EVs are way more unreliable than gas-powered c...   
3   Black Friday Deals on Electric Bikes (2023): R...   
4   Tesla Cybertruck will usher in a new ‘Powersha...   
..                                                ...   
95  A surprising player gets into the EV battery g...   
96  To meet emissions goal, decarbonize light-duty...   
97  An Electric Motorcycle Built Around a Cargo Ca...   
98  Tesla’s EV plug is great, but smoother payment...   
99  Want engineering superpowers? This GenAI start...   

                                                    3                     4  \
0   Lucid's electric vehicles will be able to plug...  2023-11-07T05:50:45Z   
1   Revel is leaving behind its roots and ending i...  2023-11-04T11:30:46Z   
2   Consumer Reports has published an extensive ra...  2023-11-29T21:22:16Z   
3   Electric bike deals are rolling for Black Frid...  2023-11-18T13:00:00Z   
4   Tesla’s Cybertruck will be the company’s first...  2023-11-30T21:47:26Z   
..                                                ...                   ...   
95  Filed under:\n Green,Electric\n Continue readi...  2023-11-12T15:00:00Z   
96  "Transportation is the highest emitting sector...  2023-11-08T15:38:08Z   
97  With the help of imaginative design, EV techno...  2023-11-17T14:00:00Z   
98  NACS is becoming the standard connector for ch...  2023-12-03T19:00:18Z   
99  Say the application that has not been attribut...  2023-11-27T16:45:11Z   

                                                    5  
0   Lucid's electric vehicles will be able to plug...  
1   Revel is leaving behind its roots and ending i...  
2   Consumer Reports has published an extensive ra...  
3   We here at WIRED are big fans of bikes, electr...  
4   Tesla Cybertruck will usher in a new Powershar...  
..                                                ...  
95  Khalid Al-Falih (AP)\r\nThe worlds biggest oil...  
96  A new study reveals a path toward reducing Uni...  
97  With the help of imaginative design, EV techno...  
98  Tesla\r\nIt’s finally happening. Up until now,...  
99  Say the application that has not been attribut...  

[100 rows x 6 columns]

Show the code

electric_vehicle_df = pd.read_csv("electric_vehicle_cleaned.csv")
rename_map = {
    '0': 'title',
    '2': 'description'
}

electric_vehicle_df.rename(columns=rename_map, inplace=True)

Show the code

cols_to_keep = rename_map.values()
electric_vehicle_df = electric_vehicle_df[cols_to_keep]

Show the code

electric_vehicle_df

	title	description
0	Engadget	Lucid EVs will be able to access Tesla's Super...
1	Engadget	Revel is shutting down its shared electric mop...
2	Engadget	EVs are way more unreliable than gas-powered c...
3	Wired	Black Friday Deals on Electric Bikes (2023): R...
4	The Verge	Tesla Cybertruck will usher in a new ‘Powersha...
...	...	...
95	Autoblog	A surprising player gets into the EV battery g...
96	Futurity: Research News	To meet emissions goal, decarbonize light-duty...
97	Core77.com	An Electric Motorcycle Built Around a Cargo Ca...
98	Digital Trends	Tesla’s EV plug is great, but smoother payment...
99	The Next Web	Want engineering superpowers? This GenAI start...

100 rows × 2 columns

Show the code

ev_text = str(electric_vehicle_df["description"])

Show the code

print(ev_text)

0     Lucid EVs will be able to access Tesla's Super...
1     Revel is shutting down its shared electric mop...
2     EVs are way more unreliable than gas-powered c...
3     Black Friday Deals on Electric Bikes (2023): R...
4     Tesla Cybertruck will usher in a new ‘Powersha...
                            ...                        
95    A surprising player gets into the EV battery g...
96    To meet emissions goal, decarbonize light-duty...
97    An Electric Motorcycle Built Around a Cargo Ca...
98    Tesla’s EV plug is great, but smoother payment...
99    Want engineering superpowers? This GenAI start...
Name: description, Length: 100, dtype: object

Word Cloud Visualization

Show the code

# MODIFIED FROM 
# https://towardsdatascience.com/simple-wordcloud-in-python-2ae54a9f58e5
def generate_word_cloud(my_text):
    from wordcloud import WordCloud, STOPWORDS
    import matplotlib.pyplot as plt
    # exit()
    # Import package
    # Define a function to plot word cloud
    def plot_cloud(wordcloud):
        # Set figure size
        plt.figure(figsize=(40, 30))
        # Display image
        plt.imshow(wordcloud) 
        # No axis details
        plt.axis("off");

    # Generate word cloud
    wordcloud = WordCloud(
        width = 3000,
        height = 2000, 
        random_state=1, 
        background_color='salmon', 
        colormap='Pastel1', 
        collocations=False,
        stopwords = STOPWORDS).generate(my_text)
    plot_cloud(wordcloud)
    plt.show()

# text='The field of machine learning is typically divided into three fundamental sub-paradigms. These include supervised learning, unsupervised learning, and reinforcement learning (RL). The discipline of reinforcement learning focuses on how intelligent agents learn to perform actions, inside a specified environment, to maximize  a cumulative reward function. Over the past several decades, there has been a push to incorporate concepts from the field of deep-learning into the agents used in RL algorithms. This has spawned the field of Deep reinforcement learning. To date, the field of deep RL has yielded stunning results in a wide range of technological applications. These include, but are not limited to, self-driving cars, autonomous game play, robotics, trading and finance, and Natural Language Processing. This course will begin with an introduction to the fundamentals of traditional, i.e. non-deep, reinforcement learning. After reviewing fundamental deep learning topics the course will transition to deep RL by incorporating artificial neural networks into the models. Topics include Markov Decision Processes, Multi-armed Bandits, Monte Carlo Methods, Temporal Difference Learning, Function Approximation, Deep Neural Networks, Actor-Critic, Deep Q-Learning, Policy Gradient Methods, and connections to Psychology and to Neuroscience.'
generate_word_cloud(ev_text)

Record Data Cleaning

Data from Rapid API:

With this dataset we carry out the following data cleaning procedures: - Converting the json file into a dataframe. - Clean the column names by renaming them and stripping any whitespaces. - Drop unneccessary columns - Save the dataframe

Show the code

import json
import pandas as pd

# Load JSON file
with open('output.json', 'r') as f:
    data = json.load(f)

# Create a Pandas DataFrame from the entire JSON data
df = pd.json_normalize(data)

# Renaming columns and stripping any spaces
df.rename(columns=lambda x: x.strip(), inplace=True)
df=df.rename(columns = {'_id':'id'})
df=df.rename(columns = {'VIN (1-10)':'VIN'})

# Drop columns that are not needed:
# df = df.drop(['Base MSRP', 'Legislative District', 'DOL Vehicle ID', 'Vehicle Location', 'Electric Utility', '2020 Census Tract'], axis=1)

# Print the DataFrame
display(df)

# saving the dataframe
df.to_csv('ev-output.csv')

	id	VIN	County	City	State	Postal Code	Model Year	Make	Model	Electric Vehicle Type	Clean Alternative Fuel Vehicle (CAFV) Eligibility	Electric Range	Base MSRP	Legislative District	DOL Vehicle ID	Vehicle Location	Electric Utility	2020 Census Tract
0	64b8150c3f35c83376286f21	1N4AZ0CP5D	Kitsap	Bremerton	WA	98310	2013	NISSAN	LEAF	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	75	0	23	214384901	POINT (-122.61136499999998 47.575195000000065)	PUGET SOUND ENERGY INC	53035080400
1	64b8150c3f35c83376286f22	1N4AZ1CP8K	Kitsap	Port Orchard	WA	98366	2019	NISSAN	LEAF	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	150	0	26	271008636	POINT (-122.63926499999997 47.53730000000007)	PUGET SOUND ENERGY INC	53035092300
2	64b8150c3f35c83376286f23	5YJXCAE28L	King	Seattle	WA	98199	2020	TESLA	MODEL X	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	293	0	36	8781552	POINT (-122.394185 47.63919500000003)	CITY OF SEATTLE - (WA)\|CITY OF TACOMA - (WA)	53033005600
3	64b8150c3f35c83376286f24	SADHC2S1XK	Thurston	Olympia	WA	98503	2019	JAGUAR	I-PACE	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	234	0	2	8308492	POINT (-122.8285 47.03646)	PUGET SOUND ENERGY INC	53067011628
4	64b8150c3f35c83376286f25	JN1AZ0CP9B	Snohomish	Everett	WA	98204	2011	NISSAN	LEAF	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	73	0	21	245524527	POINT (-122.24128499999995 47.91088000000008)	PUGET SOUND ENERGY INC	53061041901
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	64b8150c3f35c83376286f80	1N4AZ0CP8G	King	Bellevue	WA	98006	2016	NISSAN	LEAF	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	84	0	41	35325	POINT (-122.16936999999996 47.571015000000045)	PUGET SOUND ENERGY INC\|\|CITY OF TACOMA - (WA)	53033025007
96	64b8150c3f35c83376286f81	5UXKT0C53J	Yakima	Zillah	WA	98953	2018	BMW	X5	Plug-in Hybrid Electric Vehicle (PHEV)	Not eligible due to low battery range	13	0	15	153996167	POINT (-120.26033999999999 46.40493500000008)	PACIFICORP	53077002201
97	64b8150c3f35c83376286f82	5YJSA1H23F	Snohomish	Lynnwood	WA	98036	2015	TESLA	MODEL S	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	208	0	1	7821215	POINT (-122.31667499999998 47.81936500000006)	PUGET SOUND ENERGY INC	53061051929
98	64b8150c3f35c83376286f83	5YJSA1E21J	Skagit	Anacortes	WA	98221	2018	TESLA	MODEL S	Battery Electric Vehicle (BEV)	Clean Alternative Fuel Vehicle Eligible	249	0	40	285304512	POINT (-122.61530499999998 48.50127500000008)	PUGET SOUND ENERGY INC	53057940600
99	64b8150c3f35c83376286f84	1G6RL1E40G	Thurston	Rochester	WA	98579	2016	CADILLAC	ELR	Plug-in Hybrid Electric Vehicle (PHEV)	Clean Alternative Fuel Vehicle Eligible	40	0	35	161629142	POINT (-123.09574999999995 46.82114000000007)	PUGET SOUND ENERGY INC	53067012610

100 rows × 18 columns

Data from Cars API:

Show the code

record_data = pd.read_csv('cars-data.csv')

record_data.head()

	Unnamed: 0	city_mpg	class	combination_mpg	cylinders	displacement	drive	fuel_type	highway_mpg	make	model	transmission	year
0	0	18	midsize car	21	4.0	2.2	fwd	gas	26	toyota	Camry	a	1993
1	1	19	midsize car	22	4.0	2.2	fwd	gas	27	toyota	Camry	m	1993
2	2	16	midsize car	19	6.0	3.0	fwd	gas	22	toyota	Camry	a	1993
3	3	16	midsize car	18	6.0	3.0	fwd	gas	22	toyota	Camry	m	1993
4	4	18	midsize-large station wagon	21	4.0	2.2	fwd	gas	26	toyota	Camry	a	1993

Look for NaN values

Show the code

nan_count = record_data.isna().sum()

print(nan_count)

Unnamed: 0           0
city_mpg             0
class                0
combination_mpg      0
cylinders          124
displacement       124
drive                8
fuel_type            0
highway_mpg          0
make                 0
model                0
transmission         0
year                 0
dtype: int64

Double check datatypes

Show the code

record_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 719 entries, 0 to 718
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       719 non-null    int64  
 1   city_mpg         719 non-null    int64  
 2   class            719 non-null    object 
 3   combination_mpg  719 non-null    int64  
 4   cylinders        595 non-null    float64
 5   displacement     595 non-null    float64
 6   drive            711 non-null    object 
 7   fuel_type        719 non-null    object 
 8   highway_mpg      719 non-null    int64  
 9   make             719 non-null    object 
 10  model            719 non-null    object 
 11  transmission     719 non-null    object 
 12  year             719 non-null    int64  
dtypes: float64(2), int64(5), object(6)
memory usage: 73.2+ KB

Convert to desired datatype

Show the code

# Convert all 'object' type columns to 'string'
for col in record_data.select_dtypes(include=['object']).columns:
    record_data[col] = record_data[col].astype('string')

# Verify the changes
record_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 719 entries, 0 to 718
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       719 non-null    int64  
 1   city_mpg         719 non-null    int64  
 2   class            719 non-null    string 
 3   combination_mpg  719 non-null    int64  
 4   cylinders        595 non-null    float64
 5   displacement     595 non-null    float64
 6   drive            711 non-null    string 
 7   fuel_type        719 non-null    string 
 8   highway_mpg      719 non-null    int64  
 9   make             719 non-null    string 
 10  model            719 non-null    string 
 11  transmission     719 non-null    string 
 12  year             719 non-null    int64  
dtypes: float64(2), int64(5), string(6)
memory usage: 73.2 KB

Drop unnecessary columns

Show the code

# Dropping non-numerical and unnecessary columns
record_data = record_data.drop(columns=['Unnamed: 0'])

Replace NaN values

Show the code

# Replace continuous missing values with mean of the column. check for Nan values again.

cols = ['displacement', 'cylinders']
record_data[cols] = record_data[cols].fillna(record_data[cols].mean())

nan_count = record_data.isna().sum()
print(nan_count)

city_mpg           0
class              0
combination_mpg    0
cylinders          0
displacement       0
drive              8
fuel_type          0
highway_mpg        0
make               0
model              0
transmission       0
year               0
dtype: int64

Show the code

# Replace categorical missing values with mode of the column. check for Nan values again.

record_data['drive'] = record_data['drive'].fillna(record_data['drive'].mode().iloc[0])

nan_count = record_data.isna().sum()
print(nan_count)

city_mpg           0
class              0
combination_mpg    0
cylinders          0
displacement       0
drive              0
fuel_type          0
highway_mpg        0
make               0
model              0
transmission       0
year               0
dtype: int64

Resources

https://jfh.georgetown.domains/dsan5000/slides-and-labs/_site/content/labs/code-demos/API-wikipedia/wikipedia-api.html
https://jfh.georgetown.domains/dsan5000/slides-and-labs/_site/content/labs/code-demos/API-newapi/news-api.html