Module 1: Data Collection

Collecting Data Using APIs¶

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
C:\Users\alfre\AppData\Local\Temp\ipykernel_2100\912229180.py:1: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
  from IPython.core.display import display, HTML
In [2]:
import requests
import os 
from PIL import Image
from IPython.display import IFrame
import pandas as pd
import json
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page
In [3]:
url='https://www.ibm.com/'
r=requests.get(url)

Resquest¶

In [4]:
# the status of the request
r.status_code
Out[4]:
200
In [5]:
# view the request headers
r.request.headers
Out[5]:
{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': '_abck=000D7EA7D7226CDF1765622CD2AC1930~-1~YAAQCW4sF0LEcTeKAQAAmBikOApQl6zj/G3FD9K7kO+BqT1Dz+/NwDzEfFNXlwN6l7IHECnTO2dheHwLEqAsaQJPl8iK45pMqcMRsQALaLSA6+tDgYK4C6tTh5fssEWEIA9NiKPYmmgKQSI6/C7usiXcEO+foYZfmsan+d+DT2VqAUYwb2G2xDP3tQS/74oBzyiFWQWGjpvkADTjBI0umnqo7Ub79Omm+ui2EMUJ6SjT8FC+rsNkfiyGeDDX0HNw88EcBzPSQMsJOhf3G1osSQtnjdTxFbL6XiSMle4HTDsTJcfhMvdVRjXYq5fKW1O2CFFQGF4OnjNo3M12Ay2zAQTN5hEklzePkwQTyiwnNJ4/NIyXCmY=~-1~-1~-1; bm_sz=A4640AAB85E461EC3B0C237E9F551D69~YAAQCW4sF0PEcTeKAQAAmBikOBSDMfGrsdxKsBQv9fkHmJArxYMH3Vfh8XaN6JQpS2G02pUGPqsYaqkOXxq1Tx2mo/LRGmP5meAPIoarL0UvSQE6MJIY/4BdRpMqBpokiFLfVkSJ05s+eexffexHpU7YF2AEfaoe02JIqoG7705Qk5KumlXmsTzhYobJv+MEilTZYq58oOZJpGMBwHcg3pJoXk9WG3+vPirDXb/B1nDqywceXcB+PsKxr95eBjX2lncBFSPYDT6aIbIHykH3SENNr5JEbjVmg0SfV2G3w28=~3354681~4473655'}

Response¶

Text¶

In [6]:
#  view the HTTP response header
header=r.headers
print(r.headers)
{'Accept-Ranges': 'bytes', 'Content-Type': 'text/html', 'ETag': '"81cf3b5dda132f3aeeaf7cf74c998296:1692916622.064576"', 'Last-Modified': 'Thu, 24 Aug 2023 18:38:39 GMT', 'Server': 'AkamaiNetStorage', 'Cache-Control': 'max-age=303', 'Expires': 'Sun, 27 Aug 2023 20:21:35 GMT', 'X-Akamai-Transformed': '9 20630 0 pmb=mTOE,2', 'Content-Encoding': 'gzip', 'Date': 'Sun, 27 Aug 2023 20:16:32 GMT', 'Content-Length': '20823', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'x-content-type-options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': 'upgrade-insecure-requests', 'Strict-Transport-Security': 'max-age=31536000'}
In [7]:
# obtain the date the request was sent 
header['date']
Out[7]:
'Sun, 27 Aug 2023 20:16:32 GMT'
In [8]:
# Content-Type indicates the type of data
header['Content-Type']
Out[8]:
'text/html'
In [9]:
r.text[0:100]
Out[9]:
'<!DOCTYPE html><html lang="en-US"><head><meta name="viewport" content="width=device-width"/><meta ch'

Image¶

In [10]:
# Use single quotation marks for defining string
url='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/module%201/images/IDSNlogo.png'

r=requests.get(url)
r.headers['Content-Type']
Out[10]:
'image/png'
In [11]:
path=os.path.join(os.getcwd(),'image.png')
with open(path,'wb') as f:
    f.write(r.content)

# view the image:
Image.open(path) 
Out[11]:

JSON¶

In [12]:
# You can use the GET method  to  modify the results of your query
url_get='http://httpbin.org/get'
payload={"name":"Joseph","ID":"123"}
r=requests.get(url_get,params=payload)
r.url
Out[12]:
'http://httpbin.org/get?name=Joseph&ID=123'
In [13]:
r.headers['Content-Type']
Out[13]:
'application/json'
In [14]:
r.json()
Out[14]:
{'args': {'ID': '123', 'name': 'Joseph'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.28.1',
  'X-Amzn-Trace-Id': 'Root=1-64ebaf23-22536565710040f83586776f'},
 'origin': '71.105.236.170',
 'url': 'http://httpbin.org/get?name=Joseph&ID=123'}

Post Requests¶

Like a GET request a POST is used to send data to a server, but the POST request sends the data in a request body.

In [15]:
url_post='http://httpbin.org/post'
r_post=requests.post(url_post,data=payload)

r_post.json()['form']
Out[15]:
{'ID': '123', 'name': 'Joseph'}

Comparing the URL from the response object of the GET and POST request we see the POST request has no name or value pairs.

In [16]:
print("POST request URL:",r_post.url )
print("GET request URL:",r.url)
POST request URL: http://httpbin.org/post
GET request URL: http://httpbin.org/get?name=Joseph&ID=123

Comparing the POST and GET request bodies, we see only the POST request has a body:

In [17]:
print("POST request body:",r_post.request.body)
print("GET request body:",r.request.body)
POST request body: name=Joseph&ID=123
GET request body: None

Exercise: People on the International Space Station (ISS)¶

The API at http://api.open-notify.org/astros.json has information on astronauts currently on the ISS in json format.
You can read more about this API at http://open-notify.org/Open-Notify-API/People-In-Space/

In [18]:
api_url = "http://api.open-notify.org/astros.json" 
response = requests.get(api_url)
if response.ok:             # if all is well(no errors, no network timeouts)
    data = response.json()  # store the result in json format in a variable called data

print(data)   # print the data just to check the output or for debugging
{'number': 10, 'people': [{'name': 'Sergey Prokopyev', 'craft': 'ISS'}, {'name': 'Dmitry Petelin', 'craft': 'ISS'}, {'name': 'Frank Rubio', 'craft': 'ISS'}, {'name': 'Stephen Bowen', 'craft': 'ISS'}, {'name': 'Warren Hoburg', 'craft': 'ISS'}, {'name': 'Sultan Alneyadi', 'craft': 'ISS'}, {'name': 'Andrey Fedyaev', 'craft': 'ISS'}, {'name': 'Jing Haiping', 'craft': 'Tiangong'}, {'name': 'Gui Haichow', 'craft': 'Tiangong'}, {'name': 'Zhu Yangzhu', 'craft': 'Tiangong'}], 'message': 'success'}

The number of astronauts currently on ISS

In [19]:
print(data.get('number'))
10

The names of each astronaut

In [20]:
astronauts = data.get('people')
print("There are {} astronauts on ISS".format(len(astronauts)))
print("And their names are :")
for astronaut in astronauts:
    print(astronaut.get('name'))
There are 10 astronauts on ISS
And their names are :
Sergey Prokopyev
Dmitry Petelin
Frank Rubio
Stephen Bowen
Warren Hoburg
Sultan Alneyadi
Andrey Fedyaev
Jing Haiping
Gui Haichow
Zhu Yangzhu

Exercise: Number of jobs currently open¶

Set up the data source for retrieval via API

  • %run "./test API server.ipynb" alternative: link
  • You can also view the json file contents from the following json URL.
The keys in the json are¶
  • Job Title

  • Job Experience Required

  • Key Skills

  • Role Category

  • Location

  • Functional Area

  • Industry

  • Role

How many job postings exist for the Python programming language?

In [21]:
api_url="http://127.0.0.1:5000/data"
def get_number_of_jobs_T(technology):
    payload = {"Key Skills":technology}
    response = requests.get(api_url,params=payload)
    if response.ok: 
        data = response.json()
    number_of_jobs = len(data)
    return technology,number_of_jobs

get_number_of_jobs_T("Python")
Out[21]:
('Python', 1173)

How many job postings are there for the following locations:

  • Los Angeles
  • New York
  • San Francisco
  • Washington DC
  • Seattle
  • Austin
  • Detroit
In [22]:
# How many job postings exist for each location?
def get_number_of_jobs_L(location):
    api_url_get = api_url + '/get'
    payload = {"Location":location}
    response = requests.get(api_url,params=payload)
    if response.ok: 
        data = response.json()
    number_of_jobs = len(data)
    return location,number_of_jobs

L = ['Los Angeles', 'New York', 'San Francisco', 'Washington DC', 'Seattle']
for i in L:
    posting_count = get_number_of_jobs_L(i)
    print(posting_count)
('Los Angeles', 640)
('New York', 3226)
('San Francisco', 435)
('Washington DC', 5316)
('Seattle', 3375)

Collecting Data Using Webscraping¶

In [23]:
url = "http://www.ibm.com"

# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text 

soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data' and the class BeautifulSoup

Scrape all links

In [24]:
for link in soup.find_all('a'):  # in html anchor/link is represented by the tag <a>
    print(link.get('href'))
https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/ceo-generative-ai/application-modernization/
https://www.ibm.com/community/ibm-techxchange-conference
https://www.ibm.com/products/watsonx-ai
https://www.ibm.com/products/watsonx-data
https://www.ibm.com/products/spss-statistics/pricing
https://www.ibm.com/sports/usopen
https://www.ibm.com/cloud?lnk=flatitem
https://www.ibm.com/products
https://www.ibm.com/consulting
https://www.ibm.com/about
https://www.ibm.com/

Scrape all images

In [25]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link.get('src'))
https://1.dam.s81c.com/p/0c627169442d5243/ibm_watsonx_data_closeup_still_4k.jpg.global.sr_1x1.jpg
https://1.dam.s81c.com/p/0c3ce2dfcccd1f24/watsonx-data-square.jpg
https://1.dam.s81c.com/p/0c3ce2dfcccd1f25/watsonx-ai-square.jpg
https://1.dam.s81c.com/p/0b5258b292cc8c3c/ibm-SPSS-home-card.png.global.xs_1x1.png
https://1.dam.s81c.com/p/0c9c5faa18c5c7c0/0c6278d221ada9b-1230810_ibm_us_open_2023_leadspace.jpg
https://1.dam.s81c.com/p/0aac9cf57bcbf324/dotcom-1-overview.jpg

Scrape data from html tables

In [26]:
# The below url contains a html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

soup = BeautifulSoup(data,"html5lib")

#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].getText() # store the value in column 3 as color_name
    color_code = cols[3].getText() # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))
Color Name--->Hex Code#RRGGBB
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF

Module 2: Data Wrangling

About the dataset¶

Stack Overflow, a popular website for developers, conducted an online survey of software professionals across the world. The survey data was later open sourced by Stack Overflow. The actual data set has around 90,000 responses.

The dataset you are going to use in this assignment comes from the following source: https://stackoverflow.blog/2019/04/09/the-2019-stack-overflow-developer-survey-results-are-in/ under a ODbL: Open Database License.

You will be given a subset of the original data set in this capstone project. You will explore, analyze, and visualize this dataset and present your analysis.

Note: This randomised subset contains around 1/10th of the original data set. Any conclusions you draw after analyzing this subset may not reflect the real world scenario.

The dataset is available as a .csv file here.

The below table lists the questions asked in the survey and the column under which the response was collected.

Column Name Question Text
Respondent Randomized respondent ID number (not in order of survey response time)
MainBranch Which of the following options best describes you today? Here, by “developer” we mean “someone who writes code.”
Hobbyist Do you code as a hobby?
OpenSourcer How often do you contribute to open source?
OpenSource How do you feel about the quality of open source software (OSS)?
Employment Which of the following best describes your current employment status?
Country In which country do you currently reside?
Student Are you currently enrolled in a formal, degree-granting college or university program?
EdLevel Which of the following best describes the highest level of formal education that you’ve completed?
UndergradMajor What was your main or most important field of study?
EduOther Which of the following types of non-degree education have you used or participated in? Please select all that apply.
OrgSize Approximately how many people are employed by the company or organization you work for?
DevType Which of the following describe you? Please select all that apply.
YearsCode Including any education, how many years have you been coding?
Age1stCode At what age did you write your first line of code or program? (E.g., webpage, Hello World, Scratch project)
YearsCodePro How many years have you coded professionally (as a part of your work)?
CareerSat Overall, how satisfied are you with your career thus far?
JobSat How satisfied are you with your current job? (If you work multiple jobs, answer for the one you spend the most hours on.)
MgrIdiot How confident are you that your manager knows what they’re doing?
MgrMoney Do you believe that you need to be a manager to make more money?
MgrWant Do you want to become a manager yourself in the future?
JobSeek Which of the following best describes your current job-seeking status?
LastHireDate When was the last time that you took a job with a new employer?
LastInt In your most recent successful job interview (resulting in a job offer), you were asked to… (check all that apply)
FizzBuzz Have you ever been asked to solve FizzBuzz in an interview?
JobFactors Imagine that you are deciding between two job offers with the same compensation, benefits, and location. Of the following factors, which 3 are MOST important to you?
ResumeUpdate Think back to the last time you updated your resumé, CV, or an online profile on a job site. What is the PRIMARY reason that you did so?
CurrencySymbol Which currency do you use day-to-day? If your answer is complicated, please pick the one you’re most comfortable estimating in.
CurrencyDesc Which currency do you use day-to-day? If your answer is complicated, please pick the one you’re most comfortable estimating in.
CompTotal What is your current total compensation (salary, bonuses, and perks, before taxes and deductions), in CurrencySymbol? Please enter a whole number in the box below, without any punctuation. If you are paid hourly, please estimate an equivalent weekly, monthly, or yearly salary. If you prefer not to answer, please leave the box empty.
CompFreq Is that compensation weekly, monthly, or yearly?
ConvertedComp Salary converted to annual USD salaries using the exchange rate on 2019-02-01, assuming 12 working months and 50 working weeks.
WorkWeekHrs On average, how many hours per week do you work?
WorkPlan How structured or planned is your work?
WorkChallenge Of these options, what are your greatest challenges to productivity as a developer? Select up to 3:
WorkRemote How often do you work remotely?
WorkLoc Where would you prefer to work?
ImpSyn For the specific work you do, and the years of experience you have, how do you rate your own level of competence?
CodeRev Do you review code as part of your work?
CodeRevHrs On average, how many hours per week do you spend on code review?
UnitTests Does your company regularly employ unit tests in the development of their products?
PurchaseHow How does your company make decisions about purchasing new technology (cloud, AI, IoT, databases)?
PurchaseWhat What level of influence do you, personally, have over new technology purchases at your organization?
LanguageWorkedWith Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the language and want to continue to do so, please check both boxes in that row.)
LanguageDesireNextYear Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the language and want to continue to do so, please check both boxes in that row.)
DatabaseWorkedWith Which of the following database environments have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the database and want to continue to do so, please check both boxes in that row.)
DatabaseDesireNextYear Which of the following database environments have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the database and want to continue to do so, please check both boxes in that row.)
PlatformWorkedWith Which of the following platforms have you done extensive development work for over the past year? (If you both developed for the platform and want to continue to do so, please check both boxes in that row.)
PlatformDesireNextYear Which of the following platforms have you done extensive development work for over the past year? (If you both developed for the platform and want to continue to do so, please check both boxes in that row.)
WebFrameWorkedWith Which of the following web frameworks have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the framework and want to continue to do so, please check both boxes in that row.)
WebFrameDesireNextYear Which of the following web frameworks have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the framework and want to continue to do so, please check both boxes in that row.)
MiscTechWorkedWith Which of the following other frameworks, libraries, and tools have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the technology and want to continue to do so, please check both boxes in that row.)
MiscTechDesireNextYear Which of the following other frameworks, libraries, and tools have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the technology and want to continue to do so, please check both boxes in that row.)
DevEnviron Which development environment(s) do you use regularly? Please check all that apply.
OpSys What is the primary operating system in which you work?
Containers How do you use containers (Docker, Open Container Initiative (OCI), etc.)?
BlockchainOrg How is your organization thinking about or implementing blockchain technology?
BlockchainIs Blockchain / cryptocurrency technology is primarily:
BetterLife Do you think people born today will have a better life than their parents?
ITperson Are you the “IT support person” for your family?
OffOn Have you tried turning it off and on again?
SocialMedia What social media site do you use the most?
Extraversion Do you prefer online chat or IRL conversations?
ScreenName What do you call it?
SOVisit1st To the best of your memory, when did you first visit Stack Overflow?
SOVisitFreq How frequently would you say you visit Stack Overflow?
SOVisitTo I visit Stack Overflow to… (check all that apply)
SOFindAnswer On average, how many times a week do you find (and use) an answer on Stack Overflow?
SOTimeSaved Think back to the last time you solved a coding problem using Stack Overflow, as well as the last time you solved a problem using a different resource. Which was faster?
SOHowMuchTime About how much time did you save? If you’re not sure, please use your best estimate.
SOAccount Do you have a Stack Overflow account?
SOPartFreq How frequently would you say you participate in Q&A on Stack Overflow? By participate we mean ask, answer, vote for, or comment on questions.
SOJobs Have you ever used or visited Stack Overflow Jobs?
EntTeams Have you ever used Stack Overflow for Enterprise or Stack Overflow for Teams?
SOComm Do you consider yourself a member of the Stack Overflow community?
WelcomeChange Compared to last year, how welcome do you feel on Stack Overflow?
SONewContent Would you like to see any of the following on Stack Overflow? Check all that apply.
Age What is your age (in years)? If you prefer not to answer, you may leave this question blank.
Gender Which of the following do you currently identify as? Please select all that apply. If you prefer not to answer, you may leave this question blank.
Trans Do you identify as transgender?
Sexuality Which of the following do you currently identify as? Please select all that apply. If you prefer not to answer, you may leave this question blank.
Ethnicity Which of the following do you identify as? Please check all that apply. If you prefer not to answer, you may leave this question blank.
Dependents Do you have any dependents (e.g., children, elders, or others) that you care for?
SurveyLength How do you feel about the length of the survey this year?
SurveyEase How easy or difficult was this survey to complete?
In [27]:
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m1_survey_data.csv"

df = pd.read_csv(dataset_url)
df.head()
Out[27]:
Respondent MainBranch Hobbyist OpenSourcer OpenSource Employment Country Student EdLevel UndergradMajor ... WelcomeChange SONewContent Age Gender Trans Sexuality Ethnicity Dependents SurveyLength SurveyEase
0 4 I am a developer by profession No Never The quality of OSS and closed source software ... Employed full-time United States No Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 22.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Easy
1 9 I am a developer by profession Yes Once a month or more often The quality of OSS and closed source software ... Employed full-time New Zealand No Some college/university study without earning ... Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year NaN 23.0 Man No Bisexual White or of European descent No Appropriate in length Neither easy nor difficult
2 13 I am a developer by profession Yes Less than once a month but more than once per ... OSS is, on average, of HIGHER quality than pro... Employed full-time United States No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... Somewhat more welcome now than last year Tech articles written by other developers;Cour... 28.0 Man No Straight / Heterosexual White or of European descent Yes Appropriate in length Easy
3 16 I am a developer by profession Yes Never The quality of OSS and closed source software ... Employed full-time United Kingdom No Master’s degree (MA, MS, M.Eng., MBA, etc.) NaN ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 26.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Neither easy nor difficult
4 17 I am a developer by profession Yes Less than once a month but more than once per ... The quality of OSS and closed source software ... Employed full-time Australia No Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 29.0 Man No Straight / Heterosexual Hispanic or Latino/Latina;Multiracial No Appropriate in length Easy

5 rows × 85 columns

Finding Duplicates¶

In [28]:
# occurrence based on all columns
df[df.duplicated()]#.head()
Out[28]:
Respondent MainBranch Hobbyist OpenSourcer OpenSource Employment Country Student EdLevel UndergradMajor ... WelcomeChange SONewContent Age Gender Trans Sexuality Ethnicity Dependents SurveyLength SurveyEase
1168 2339 I am a developer by profession Yes Once a month or more often OSS is, on average, of HIGHER quality than pro... Employed full-time United States No Some college/university study without earning ... Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year NaN 24.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Easy
1169 2342 I am a developer by profession Yes Never The quality of OSS and closed source software ... Employed full-time United Kingdom No Some college/university study without earning ... Information systems, information technology, o... ... Somewhat more welcome now than last year Tech meetups or events in your area;Courses on... 24.0 Man No Straight / Heterosexual White or of European descent No Too long Easy
1170 2343 I am a developer by profession Yes Less than once a month but more than once per ... OSS is, on average, of LOWER quality than prop... Employed full-time Canada No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... Somewhat more welcome now than last year Tech articles written by other developers;Indu... 27.0 Man No Straight / Heterosexual Black or of African descent;White or of Europe... No Appropriate in length Neither easy nor difficult
1171 2344 I am a developer by profession Yes Never The quality of OSS and closed source software ... Employed full-time United States No Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 24.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Easy
1172 2347 I am a developer by profession Yes Never OSS is, on average, of HIGHER quality than pro... Employed full-time United Kingdom No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year NaN NaN Woman No Straight / Heterosexual Biracial No Too long Easy
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2297 4674 I am not primarily a developer, but I write co... Yes Less than once per year The quality of OSS and closed source software ... Employed full-time Bangladesh No Bachelor’s degree (BA, BS, B.Eng., etc.) NaN ... Somewhat less welcome now than last year Tech articles written by other developers;Indu... 31.0 Man No Bisexual;Gay or Lesbian;Straight / Heterosexual Black or of African descent;Hispanic or Latino... Yes Too long Neither easy nor difficult
2298 4675 I am a developer by profession Yes Never OSS is, on average, of HIGHER quality than pro... Employed full-time United States No Bachelor’s degree (BA, BS, B.Eng., etc.) Information systems, information technology, o... ... Just as welcome now as I felt last year Tech meetups or events in your area 27.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Easy
2299 4676 I am a developer by profession Yes Never OSS is, on average, of HIGHER quality than pro... Employed full-time Finland No Master’s degree (MA, MS, M.Eng., MBA, etc.) Another engineering discipline (ex. civil, ele... ... Somewhat less welcome now than last year NaN 36.0 Man No Straight / Heterosexual White or of European descent Yes Too long Easy
2300 4677 I am a developer by profession Yes Once a month or more often OSS is, on average, of HIGHER quality than pro... Employed full-time United Kingdom No Bachelor’s degree (BA, BS, B.Eng., etc.) A natural science (ex. biology, chemistry, phy... ... Just as welcome now as I felt last year NaN 40.0 Man No Straight / Heterosexual White or of European descent Yes Appropriate in length Easy
2301 4679 I am a developer by profession Yes Less than once a month but more than once per ... The quality of OSS and closed source software ... Employed full-time United States No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year NaN 27.0 Man No NaN White or of European descent No Appropriate in length Easy

154 rows × 85 columns

Removing Duplicates¶

In [29]:
# occurrence based on all columns
df[~df.duplicated()]
Out[29]:
Respondent MainBranch Hobbyist OpenSourcer OpenSource Employment Country Student EdLevel UndergradMajor ... WelcomeChange SONewContent Age Gender Trans Sexuality Ethnicity Dependents SurveyLength SurveyEase
0 4 I am a developer by profession No Never The quality of OSS and closed source software ... Employed full-time United States No Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 22.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Easy
1 9 I am a developer by profession Yes Once a month or more often The quality of OSS and closed source software ... Employed full-time New Zealand No Some college/university study without earning ... Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year NaN 23.0 Man No Bisexual White or of European descent No Appropriate in length Neither easy nor difficult
2 13 I am a developer by profession Yes Less than once a month but more than once per ... OSS is, on average, of HIGHER quality than pro... Employed full-time United States No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... Somewhat more welcome now than last year Tech articles written by other developers;Cour... 28.0 Man No Straight / Heterosexual White or of European descent Yes Appropriate in length Easy
3 16 I am a developer by profession Yes Never The quality of OSS and closed source software ... Employed full-time United Kingdom No Master’s degree (MA, MS, M.Eng., MBA, etc.) NaN ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 26.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Neither easy nor difficult
4 17 I am a developer by profession Yes Less than once a month but more than once per ... The quality of OSS and closed source software ... Employed full-time Australia No Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 29.0 Man No Straight / Heterosexual Hispanic or Latino/Latina;Multiracial No Appropriate in length Easy
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11547 25136 I am a developer by profession Yes Never OSS is, on average, of HIGHER quality than pro... Employed full-time United States No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Cour... 36.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Difficult
11548 25137 I am a developer by profession Yes Never The quality of OSS and closed source software ... Employed full-time Poland No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... A lot more welcome now than last year Tech articles written by other developers;Tech... 25.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Neither easy nor difficult
11549 25138 I am a developer by profession Yes Less than once per year The quality of OSS and closed source software ... Employed full-time United States No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... A lot more welcome now than last year Tech articles written by other developers;Indu... 34.0 Man No Straight / Heterosexual White or of European descent Yes Too long Easy
11550 25141 I am a developer by profession Yes Less than once a month but more than once per ... OSS is, on average, of LOWER quality than prop... Employed full-time Switzerland No Secondary school (e.g. American high school, G... NaN ... Somewhat less welcome now than last year NaN 25.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Easy
11551 25142 I am a developer by profession Yes Less than once a month but more than once per ... OSS is, on average, of HIGHER quality than pro... Employed full-time United Kingdom No Other doctoral degree (Ph.D, Ed.D., etc.) A natural science (ex. biology, chemistry, phy... ... Just as welcome now as I felt last year Tech articles written by other developers;Tech... 30.0 Man No Bisexual White or of European descent No Appropriate in length Easy

11398 rows × 85 columns

In [30]:
df = df.drop_duplicates()
df
Out[30]:
Respondent MainBranch Hobbyist OpenSourcer OpenSource Employment Country Student EdLevel UndergradMajor ... WelcomeChange SONewContent Age Gender Trans Sexuality Ethnicity Dependents SurveyLength SurveyEase
0 4 I am a developer by profession No Never The quality of OSS and closed source software ... Employed full-time United States No Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 22.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Easy
1 9 I am a developer by profession Yes Once a month or more often The quality of OSS and closed source software ... Employed full-time New Zealand No Some college/university study without earning ... Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year NaN 23.0 Man No Bisexual White or of European descent No Appropriate in length Neither easy nor difficult
2 13 I am a developer by profession Yes Less than once a month but more than once per ... OSS is, on average, of HIGHER quality than pro... Employed full-time United States No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... Somewhat more welcome now than last year Tech articles written by other developers;Cour... 28.0 Man No Straight / Heterosexual White or of European descent Yes Appropriate in length Easy
3 16 I am a developer by profession Yes Never The quality of OSS and closed source software ... Employed full-time United Kingdom No Master’s degree (MA, MS, M.Eng., MBA, etc.) NaN ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 26.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Neither easy nor difficult
4 17 I am a developer by profession Yes Less than once a month but more than once per ... The quality of OSS and closed source software ... Employed full-time Australia No Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Indu... 29.0 Man No Straight / Heterosexual Hispanic or Latino/Latina;Multiracial No Appropriate in length Easy
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11547 25136 I am a developer by profession Yes Never OSS is, on average, of HIGHER quality than pro... Employed full-time United States No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... Just as welcome now as I felt last year Tech articles written by other developers;Cour... 36.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Difficult
11548 25137 I am a developer by profession Yes Never The quality of OSS and closed source software ... Employed full-time Poland No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... A lot more welcome now than last year Tech articles written by other developers;Tech... 25.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Neither easy nor difficult
11549 25138 I am a developer by profession Yes Less than once per year The quality of OSS and closed source software ... Employed full-time United States No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or sof... ... A lot more welcome now than last year Tech articles written by other developers;Indu... 34.0 Man No Straight / Heterosexual White or of European descent Yes Too long Easy
11550 25141 I am a developer by profession Yes Less than once a month but more than once per ... OSS is, on average, of LOWER quality than prop... Employed full-time Switzerland No Secondary school (e.g. American high school, G... NaN ... Somewhat less welcome now than last year NaN 25.0 Man No Straight / Heterosexual White or of European descent No Appropriate in length Easy
11551 25142 I am a developer by profession Yes Less than once a month but more than once per ... OSS is, on average, of HIGHER quality than pro... Employed full-time United Kingdom No Other doctoral degree (Ph.D, Ed.D., etc.) A natural science (ex. biology, chemistry, phy... ... Just as welcome now as I felt last year Tech articles written by other developers;Tech... 30.0 Man No Bisexual White or of European descent No Appropriate in length Easy

11398 rows × 85 columns

Finding Missing Values¶

In [31]:
# Evaluating for Missing Data using either .isnull() or .notnull()
df['WorkLoc'].isnull().value_counts()
Out[31]:
False    11366
True        32
Name: WorkLoc, dtype: int64
In [32]:
df['ConvertedComp'].isnull().value_counts()
Out[32]:
False    10582
True       816
Name: ConvertedComp, dtype: int64

Imputing missing values¶

When to impute with mode

In [33]:
# calculate the most common value in the Workloc column
mcv = df['WorkLoc'].value_counts().idxmax()

# replace the missing 'WorkLoc' values with the most frequent
df["WorkLoc"].replace(np.nan,mcv, inplace=True)

# Verify if imputing was successful
df['WorkLoc'].isnull().value_counts()
Out[33]:
False    11398
Name: WorkLoc, dtype: int64

When to impute with median

In [34]:
# calculate median compensation
medcomp = df['ConvertedComp'].median()

# replace the missing 'ConvertedComp' values with the median
df["ConvertedComp"].replace(np.nan,medcomp, inplace=True)

# Verify if imputing was successful
df['ConvertedComp'].isnull().value_counts()
Out[34]:
False    11398
Name: ConvertedComp, dtype: int64

Normalizing Data¶

There are two columns in the dataset that talk about compensation. One is "CompFreq". This column shows how often a developer is paid (Yearly, Monthly, Weekly).The other is "CompTotal". This column talks about how much the developer is paid per Year, Month, or Week depending upon his/her "CompFreq". This makes it difficult to compare the total compensation of the developers.

Create a new column called 'NormalizedAnnualCompensation' which contains the 'Annual Compensation' irrespective of the 'CompFreq'.

Once this column is ready, it makes comparison of salaries easy.

In [35]:
df['CompFreq'].value_counts()
Out[35]:
Yearly     6073
Monthly    4788
Weekly      331
Name: CompFreq, dtype: int64
In [36]:
def calculate_value(row):
    if row == 'Yearly':
        return 1
    elif row == 'Monthly':
        return 12
    elif row == 'Weekly':
        return 52
    else:
        return None

# Apply the function to create 'Column B'
df['CompFreqVal'] = df['CompFreq'].apply(calculate_value)

df['NormalizedAnnualCompensation'] = df['CompTotal']*df['CompFreqVal']
In [37]:
df['NormalizedAnnualCompensation'].head()
Out[37]:
0     61000.0
1    138000.0
2     90000.0
3    348000.0
4     90000.0
Name: NormalizedAnnualCompensation, dtype: float64

Module 3: Exploratory Data Analysis

In [38]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv")

Distribution¶

The column ConvertedComp contains Salary converted to annual USD salaries using the exchange rate on 2019-02-01. This assumes 12 working months and 50 working weeks.

Plot the distribution curve for the column ConvertedComp

In [39]:
# sns.distplot(df ['ConvertedComp'].dropna(),hist=False)
# df['ConvertedComp'].fillna(df['ConvertedComp'].mean(), inplace=True)
# sns.distplot (df['ConvertedComp'], hist = False)
sns.displot (df['ConvertedComp'], kind = 'kde')
Out[39]:
<seaborn.axisgrid.FacetGrid at 0x28558379350>

Plot the histogram for the column ConvertedComp

In [40]:
count, bin_edges = np.histogram(df['ConvertedComp'].dropna())

df['ConvertedComp'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)

plt.title('Histogram of Salary converted to annual USD salaries') # add a title to the histogram
plt.ylabel('Salary in USD') # add y-label
plt.xlabel('Number of Salary') # add x-label

plt.show()

Give the five number summary for the column Age?

In [41]:
df['Age'].describe()
Out[41]:
count    11111.000000
mean        30.778895
std          7.393686
min         16.000000
25%         25.000000
50%         29.000000
75%         35.000000
max         99.000000
Name: Age, dtype: float64

Plot a histogram of the column Age.

In [42]:
count, bin_edges = np.histogram(df['Age'].dropna())

df['Age'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)

plt.title('Histogram of Age') # add a title to the histogram
plt.ylabel('Age') # add y-label
plt.xlabel('Count') # add x-label

plt.show()

Outliers¶

Find out if outliers exist in the column ConvertedComp using a box plot

In [43]:
df['ConvertedComp'].plot(kind='box', figsize=(15,7))

plt.title('Box plot of Salalry in USD')
plt.ylabel('Number of Immigrants')

plt.show()

Based on the boxplot of ‘Age’ how many outliers do you see below Q1?
Ans: Zero
"

Find out the Inter Quartile Range for the column ConvertedComp.

In [44]:
Q1 = df['ConvertedComp'].quantile(0.25)
Q3 = df['ConvertedComp'].quantile(0.75)
IQR = Q3 - Q1
IQR
Out[44]:
73132.0

Find out the upper and lower bounds.

In [45]:
print("Upper Bound: {}".format(df['ConvertedComp'].max()))
print("Lower Bound: {}".format(df['ConvertedComp'].min()))
Upper Bound: 2000000.0
Lower Bound: 0.0

Identify how many outliers are there in the ConvertedComp column.

In [46]:
((df['ConvertedComp'] < (Q1 - 1.5 * IQR)) | (df['ConvertedComp'] > (Q3 + 1.5 * IQR))).sum()
Out[46]:
879

Create a new dataframe by removing the outliers from the ConvertedComp column.

In [47]:
mask = (df['ConvertedComp'] < (Q1 - 1.5 * IQR)) | (df['ConvertedComp'] > (Q3 + 1.5 * IQR))
df[mask] = np.nan
df['ConvertedComp'].mean()
Out[47]:
59883.20838915799

Correlation¶

Find the correlation between Age and all other numerical columns.

In [48]:
df.corr()
C:\Users\alfre\AppData\Local\Temp\ipykernel_2100\1134722465.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.corr()
Out[48]:
Respondent CompTotal ConvertedComp WorkWeekHrs CodeRevHrs Age
Respondent 1.000000 -0.019364 0.010878 -0.015275 0.002980 0.003950
CompTotal -0.019364 1.000000 -0.063561 0.004975 0.017536 0.006371
ConvertedComp 0.010878 -0.063561 1.000000 0.034351 -0.088934 0.401821
WorkWeekHrs -0.015275 0.004975 0.034351 1.000000 0.031963 0.037452
CodeRevHrs 0.002980 0.017536 -0.088934 0.031963 1.000000 -0.017961
Age 0.003950 0.006371 0.401821 0.037452 -0.017961 1.000000
  • Which column has a negative correlation with "Age"?
  • Which column has the highest correlation with "Age"?

Module 4: Data Visualization

Visualizing the distribution of data

  • Histograms
  • Box plots

Visualizing relationship

  • Scatter plots
  • Bubble plots

Visualizing composition of data

  • Pie charts
  • Stacked charts

Visualizing comparison of data

  • Line charts
  • Bar charts

Module 5: Presentation of Findings

After data has been collected, cleaned and organized, the work of interpretation begins. You are now able to obtain a complete view of the data and hopefully answer the questions that were formed before starting the analysis.Now you typically compose a findings report that explains what was learned. Depending on the audience, the report can be Depending on the audience, the report can be

  • a paper style report
  • a slideshow presentation
  • maybe both

Outline¶

  • Cover page
    • Title of presentation
    • Name(s) of author(s)
    • Author(s)' affiliations (optional)
    • Author(s)' contact information (optional)
    • Institutional publisher name (optional)
    • Date (of publication)
  • TOC
    • Sections of the report
      • Subesections
  • Executive summary
    • Briefly explain the details of project
    • Considered a standalone document
    • Taken from the main points of the report. While it is acceptable to repeat information, no new information is presented
  • Introduction
    • Explain the nature of the analysis
    • State the problem
    • State the questions for analysis
  • Literature Review (good to have)
    • Review available relevant research on the subject matter
    • Length of this section depends on how contested the subject matter is

    In instances where the vast majority of researchers have concluded in one direction, the literature review could be brief with citations for only the most influential authors on the subject.
  • Methodology
    • Explain the data sources that were used in the analysis
    • Outline the plan for the collected data

    Eg. Was the cluster or regression method used to analyze the data
    • If you have collected new data, explain the data collection exercise in some detail.
    • Refer to the literature review to bolster your choice for variables, data, and methods and how they will help you answer your research questions.
  • Results
    • Present your empirical findings
    • Contains the charts and graphs that would substantiate the results and call attention to more complex/crucial findings. Starting with descriptive statistics and illustrative graphics, move toward formally testing your hypothesis (if needed)
    • Go into the detail of the data collection
      • how it was organized
      • how it was analyzed
    • Give a detailed interpretation/explanation of data to the audience and convey how it relates to the problem that was stated in the introduction

    Note: many reports in the business sector present results in a more palatable fashion by holding back the statistical details
  • Discussion
    • Rely on the power of narrative to enable numbers to communicate your thesis to your readers.
    • Refer the reader to the research question and the knowledge gaps you identified earlier; highlight how your findings provide the missing piece to the puzzle
    • Engage the audience with a discussion of your implications that were drawn from your research
    Eg. say youre conducting research ofor top programming languages for college graduates. would you find they need to learn multiple languages to remain competitive in the job market or qould 1 language remain supreme

    Of course, not all analytics return a smoking gun. At times, more frequently than I would like to acknowledge, the results provide only a partial answer to the question and that, too, with a long list of caveats.
  • Conclusion
    • Generalize your specific findings and take on a rather marketing approach to promote your findings so that the reader does not remain stuck in the caveats that you have voluntarily outlined earlier
    • You might also identify future possible developments in research and applications that could result from your research.
    • reiterate the problem statend in the introduction
    • goive overall summary of the findings
    • state the outcome of the analysis and if any other stateps will be taken in the future
  • Appendix
    • contains info that really didnt fit in the main body of the report but you deemed it still important enough to include
      • location where the raw data was collected
      • resources/acknowledgements/references

Have You Done Your Job as a Writer?

As a data scientist, you are expected to do thorough analysis with the appropriate data, deploying the appropriate tools. As a writer, you are responsible for communicating your findings to the readers. Transport Policy, a leading research publication in transportation planning, offers a checklist for authors interested in publishing with the journal. The checklist is a series of questions authors are expected to consider before submitting their manuscripts to the journal. I believe the checklist is useful for budding data scientists and, therefore, I have reproduced it verbatim for their benefit.

  • Have you told readers, at the outset, what they might gain by reading your paper?

  • Have you made the aim of your work clear?

  • Have you explained the significance of your contribution?

  • Have you set your work in the appropriate context by giving sufficient background (including a complete set of relevant references) to your work?

  • Have you addressed the question of practicality and usefulness?

  • Have you identified future developments that might result from your work?

  • Have you structured your paper in a clear and logical fashion?

Best practices for presenting your findings¶

  • make sure charts and graphs are not too small and are clearly labelled
  • use the data only as supporting evidence
  • share onlyb one point from each chart
    • begin by forming the key messages that need to be conveyed to the audience
    • and build the story around these messages
    • after building the outline go back and insert the data that/to support these findings
  • eliminate data that does not support the key message

some items that seem intereesting to the analyst may not be relevant to the project. trying to explain every little detail to your audience and not recognizing irrelevant data could damage the key message

In [ ]: