from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
C:\Users\alfre\AppData\Local\Temp\ipykernel_2100\912229180.py:1: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display from IPython.core.display import display, HTML
import requests
import os
from PIL import Image
from IPython.display import IFrame
import pandas as pd
import json
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
url='https://www.ibm.com/'
r=requests.get(url)
# the status of the request
r.status_code
200
# view the request headers
r.request.headers
{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': '_abck=000D7EA7D7226CDF1765622CD2AC1930~-1~YAAQCW4sF0LEcTeKAQAAmBikOApQl6zj/G3FD9K7kO+BqT1Dz+/NwDzEfFNXlwN6l7IHECnTO2dheHwLEqAsaQJPl8iK45pMqcMRsQALaLSA6+tDgYK4C6tTh5fssEWEIA9NiKPYmmgKQSI6/C7usiXcEO+foYZfmsan+d+DT2VqAUYwb2G2xDP3tQS/74oBzyiFWQWGjpvkADTjBI0umnqo7Ub79Omm+ui2EMUJ6SjT8FC+rsNkfiyGeDDX0HNw88EcBzPSQMsJOhf3G1osSQtnjdTxFbL6XiSMle4HTDsTJcfhMvdVRjXYq5fKW1O2CFFQGF4OnjNo3M12Ay2zAQTN5hEklzePkwQTyiwnNJ4/NIyXCmY=~-1~-1~-1; bm_sz=A4640AAB85E461EC3B0C237E9F551D69~YAAQCW4sF0PEcTeKAQAAmBikOBSDMfGrsdxKsBQv9fkHmJArxYMH3Vfh8XaN6JQpS2G02pUGPqsYaqkOXxq1Tx2mo/LRGmP5meAPIoarL0UvSQE6MJIY/4BdRpMqBpokiFLfVkSJ05s+eexffexHpU7YF2AEfaoe02JIqoG7705Qk5KumlXmsTzhYobJv+MEilTZYq58oOZJpGMBwHcg3pJoXk9WG3+vPirDXb/B1nDqywceXcB+PsKxr95eBjX2lncBFSPYDT6aIbIHykH3SENNr5JEbjVmg0SfV2G3w28=~3354681~4473655'}
# view the HTTP response header
header=r.headers
print(r.headers)
{'Accept-Ranges': 'bytes', 'Content-Type': 'text/html', 'ETag': '"81cf3b5dda132f3aeeaf7cf74c998296:1692916622.064576"', 'Last-Modified': 'Thu, 24 Aug 2023 18:38:39 GMT', 'Server': 'AkamaiNetStorage', 'Cache-Control': 'max-age=303', 'Expires': 'Sun, 27 Aug 2023 20:21:35 GMT', 'X-Akamai-Transformed': '9 20630 0 pmb=mTOE,2', 'Content-Encoding': 'gzip', 'Date': 'Sun, 27 Aug 2023 20:16:32 GMT', 'Content-Length': '20823', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'x-content-type-options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': 'upgrade-insecure-requests', 'Strict-Transport-Security': 'max-age=31536000'}
# obtain the date the request was sent
header['date']
'Sun, 27 Aug 2023 20:16:32 GMT'
# Content-Type indicates the type of data
header['Content-Type']
'text/html'
r.text[0:100]
'<!DOCTYPE html><html lang="en-US"><head><meta name="viewport" content="width=device-width"/><meta ch'
# Use single quotation marks for defining string
url='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/module%201/images/IDSNlogo.png'
r=requests.get(url)
r.headers['Content-Type']
'image/png'
path=os.path.join(os.getcwd(),'image.png')
with open(path,'wb') as f:
f.write(r.content)
# view the image:
Image.open(path)
# You can use the GET method to modify the results of your query
url_get='http://httpbin.org/get'
payload={"name":"Joseph","ID":"123"}
r=requests.get(url_get,params=payload)
r.url
'http://httpbin.org/get?name=Joseph&ID=123'
r.headers['Content-Type']
'application/json'
r.json()
{'args': {'ID': '123', 'name': 'Joseph'},
'headers': {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Host': 'httpbin.org',
'User-Agent': 'python-requests/2.28.1',
'X-Amzn-Trace-Id': 'Root=1-64ebaf23-22536565710040f83586776f'},
'origin': '71.105.236.170',
'url': 'http://httpbin.org/get?name=Joseph&ID=123'}
Like a GET request a POST is used to send data to a server, but the POST request sends the data in a request body.
url_post='http://httpbin.org/post'
r_post=requests.post(url_post,data=payload)
r_post.json()['form']
{'ID': '123', 'name': 'Joseph'}
Comparing the URL from the response object of the GET and POST request we see the POST request has no name or value pairs.
print("POST request URL:",r_post.url )
print("GET request URL:",r.url)
POST request URL: http://httpbin.org/post GET request URL: http://httpbin.org/get?name=Joseph&ID=123
Comparing the POST and GET request bodies, we see only the POST request has a body:
print("POST request body:",r_post.request.body)
print("GET request body:",r.request.body)
POST request body: name=Joseph&ID=123 GET request body: None
The API at http://api.open-notify.org/astros.json has information on astronauts currently on the ISS in json format.
You can read more about this API at http://open-notify.org/Open-Notify-API/People-In-Space/
api_url = "http://api.open-notify.org/astros.json"
response = requests.get(api_url)
if response.ok: # if all is well(no errors, no network timeouts)
data = response.json() # store the result in json format in a variable called data
print(data) # print the data just to check the output or for debugging
{'number': 10, 'people': [{'name': 'Sergey Prokopyev', 'craft': 'ISS'}, {'name': 'Dmitry Petelin', 'craft': 'ISS'}, {'name': 'Frank Rubio', 'craft': 'ISS'}, {'name': 'Stephen Bowen', 'craft': 'ISS'}, {'name': 'Warren Hoburg', 'craft': 'ISS'}, {'name': 'Sultan Alneyadi', 'craft': 'ISS'}, {'name': 'Andrey Fedyaev', 'craft': 'ISS'}, {'name': 'Jing Haiping', 'craft': 'Tiangong'}, {'name': 'Gui Haichow', 'craft': 'Tiangong'}, {'name': 'Zhu Yangzhu', 'craft': 'Tiangong'}], 'message': 'success'}
The number of astronauts currently on ISS
print(data.get('number'))
10
The names of each astronaut
astronauts = data.get('people')
print("There are {} astronauts on ISS".format(len(astronauts)))
print("And their names are :")
for astronaut in astronauts:
print(astronaut.get('name'))
There are 10 astronauts on ISS And their names are : Sergey Prokopyev Dmitry Petelin Frank Rubio Stephen Bowen Warren Hoburg Sultan Alneyadi Andrey Fedyaev Jing Haiping Gui Haichow Zhu Yangzhu
Job Title
Job Experience Required
Key Skills
Role Category
Location
Functional Area
Industry
Role
How many job postings exist for the Python programming language?
api_url="http://127.0.0.1:5000/data"
def get_number_of_jobs_T(technology):
payload = {"Key Skills":technology}
response = requests.get(api_url,params=payload)
if response.ok:
data = response.json()
number_of_jobs = len(data)
return technology,number_of_jobs
get_number_of_jobs_T("Python")
('Python', 1173)
How many job postings are there for the following locations:
# How many job postings exist for each location?
def get_number_of_jobs_L(location):
api_url_get = api_url + '/get'
payload = {"Location":location}
response = requests.get(api_url,params=payload)
if response.ok:
data = response.json()
number_of_jobs = len(data)
return location,number_of_jobs
L = ['Los Angeles', 'New York', 'San Francisco', 'Washington DC', 'Seattle']
for i in L:
posting_count = get_number_of_jobs_L(i)
print(posting_count)
('Los Angeles', 640)
('New York', 3226)
('San Francisco', 435)
('Washington DC', 5316)
('Seattle', 3375)
url = "http://www.ibm.com"
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text
soup = BeautifulSoup(data,"html5lib") # create a soup object using the variable 'data' and the class BeautifulSoup
Scrape all links
for link in soup.find_all('a'): # in html anchor/link is represented by the tag <a>
print(link.get('href'))
https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/ceo-generative-ai/application-modernization/ https://www.ibm.com/community/ibm-techxchange-conference https://www.ibm.com/products/watsonx-ai https://www.ibm.com/products/watsonx-data https://www.ibm.com/products/spss-statistics/pricing https://www.ibm.com/sports/usopen https://www.ibm.com/cloud?lnk=flatitem https://www.ibm.com/products https://www.ibm.com/consulting https://www.ibm.com/about https://www.ibm.com/
Scrape all images
for link in soup.find_all('img'):# in html image is represented by the tag <img>
print(link.get('src'))
https://1.dam.s81c.com/p/0c627169442d5243/ibm_watsonx_data_closeup_still_4k.jpg.global.sr_1x1.jpg https://1.dam.s81c.com/p/0c3ce2dfcccd1f24/watsonx-data-square.jpg https://1.dam.s81c.com/p/0c3ce2dfcccd1f25/watsonx-ai-square.jpg https://1.dam.s81c.com/p/0b5258b292cc8c3c/ibm-SPSS-home-card.png.global.xs_1x1.png https://1.dam.s81c.com/p/0c9c5faa18c5c7c0/0c6278d221ada9b-1230810_ibm_us_open_2023_leadspace.jpg https://1.dam.s81c.com/p/0aac9cf57bcbf324/dotcom-1-overview.jpg
Scrape data from html tables
# The below url contains a html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
# Get all columns in each row.
cols = row.find_all('td') # in html a column is represented by the tag <td>
color_name = cols[2].getText() # store the value in column 3 as color_name
color_code = cols[3].getText() # store the value in column 4 as color_code
print("{}--->{}".format(color_name,color_code))
Color Name--->Hex Code#RRGGBB lightsalmon--->#FFA07A salmon--->#FA8072 darksalmon--->#E9967A lightcoral--->#F08080 coral--->#FF7F50 tomato--->#FF6347 orangered--->#FF4500 gold--->#FFD700 orange--->#FFA500 darkorange--->#FF8C00 lightyellow--->#FFFFE0 lemonchiffon--->#FFFACD papayawhip--->#FFEFD5 moccasin--->#FFE4B5 peachpuff--->#FFDAB9 palegoldenrod--->#EEE8AA khaki--->#F0E68C darkkhaki--->#BDB76B yellow--->#FFFF00 lawngreen--->#7CFC00 chartreuse--->#7FFF00 limegreen--->#32CD32 lime--->#00FF00 forestgreen--->#228B22 green--->#008000 powderblue--->#B0E0E6 lightblue--->#ADD8E6 lightskyblue--->#87CEFA skyblue--->#87CEEB deepskyblue--->#00BFFF lightsteelblue--->#B0C4DE dodgerblue--->#1E90FF
Stack Overflow, a popular website for developers, conducted an online survey of software professionals across the world. The survey data was later open sourced by Stack Overflow. The actual data set has around 90,000 responses.
The dataset you are going to use in this assignment comes from the following source: https://stackoverflow.blog/2019/04/09/the-2019-stack-overflow-developer-survey-results-are-in/ under a ODbL: Open Database License.
You will be given a subset of the original data set in this capstone project. You will explore, analyze, and visualize this dataset and present your analysis.
Note: This randomised subset contains around 1/10th of the original data set. Any conclusions you draw after analyzing this subset may not reflect the real world scenario.
The dataset is available as a .csv file here.
The below table lists the questions asked in the survey and the column under which the response was collected.
| Column Name | Question Text |
|---|---|
| Respondent | Randomized respondent ID number (not in order of survey response time) |
| MainBranch | Which of the following options best describes you today? Here, by “developer” we mean “someone who writes code.” |
| Hobbyist | Do you code as a hobby? |
| OpenSourcer | How often do you contribute to open source? |
| OpenSource | How do you feel about the quality of open source software (OSS)? |
| Employment | Which of the following best describes your current employment status? |
| Country | In which country do you currently reside? |
| Student | Are you currently enrolled in a formal, degree-granting college or university program? |
| EdLevel | Which of the following best describes the highest level of formal education that you’ve completed? |
| UndergradMajor | What was your main or most important field of study? |
| EduOther | Which of the following types of non-degree education have you used or participated in? Please select all that apply. |
| OrgSize | Approximately how many people are employed by the company or organization you work for? |
| DevType | Which of the following describe you? Please select all that apply. |
| YearsCode | Including any education, how many years have you been coding? |
| Age1stCode | At what age did you write your first line of code or program? (E.g., webpage, Hello World, Scratch project) |
| YearsCodePro | How many years have you coded professionally (as a part of your work)? |
| CareerSat | Overall, how satisfied are you with your career thus far? |
| JobSat | How satisfied are you with your current job? (If you work multiple jobs, answer for the one you spend the most hours on.) |
| MgrIdiot | How confident are you that your manager knows what they’re doing? |
| MgrMoney | Do you believe that you need to be a manager to make more money? |
| MgrWant | Do you want to become a manager yourself in the future? |
| JobSeek | Which of the following best describes your current job-seeking status? |
| LastHireDate | When was the last time that you took a job with a new employer? |
| LastInt | In your most recent successful job interview (resulting in a job offer), you were asked to… (check all that apply) |
| FizzBuzz | Have you ever been asked to solve FizzBuzz in an interview? |
| JobFactors | Imagine that you are deciding between two job offers with the same compensation, benefits, and location. Of the following factors, which 3 are MOST important to you? |
| ResumeUpdate | Think back to the last time you updated your resumé, CV, or an online profile on a job site. What is the PRIMARY reason that you did so? |
| CurrencySymbol | Which currency do you use day-to-day? If your answer is complicated, please pick the one you’re most comfortable estimating in. |
| CurrencyDesc | Which currency do you use day-to-day? If your answer is complicated, please pick the one you’re most comfortable estimating in. |
| CompTotal | What is your current total compensation (salary, bonuses, and perks, before taxes and deductions), in CurrencySymbol? Please enter a whole number in the box below, without any punctuation. If you are paid hourly, please estimate an equivalent weekly, monthly, or yearly salary. If you prefer not to answer, please leave the box empty. |
| CompFreq | Is that compensation weekly, monthly, or yearly? |
| ConvertedComp | Salary converted to annual USD salaries using the exchange rate on 2019-02-01, assuming 12 working months and 50 working weeks. |
| WorkWeekHrs | On average, how many hours per week do you work? |
| WorkPlan | How structured or planned is your work? |
| WorkChallenge | Of these options, what are your greatest challenges to productivity as a developer? Select up to 3: |
| WorkRemote | How often do you work remotely? |
| WorkLoc | Where would you prefer to work? |
| ImpSyn | For the specific work you do, and the years of experience you have, how do you rate your own level of competence? |
| CodeRev | Do you review code as part of your work? |
| CodeRevHrs | On average, how many hours per week do you spend on code review? |
| UnitTests | Does your company regularly employ unit tests in the development of their products? |
| PurchaseHow | How does your company make decisions about purchasing new technology (cloud, AI, IoT, databases)? |
| PurchaseWhat | What level of influence do you, personally, have over new technology purchases at your organization? |
| LanguageWorkedWith | Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the language and want to continue to do so, please check both boxes in that row.) |
| LanguageDesireNextYear | Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the language and want to continue to do so, please check both boxes in that row.) |
| DatabaseWorkedWith | Which of the following database environments have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the database and want to continue to do so, please check both boxes in that row.) |
| DatabaseDesireNextYear | Which of the following database environments have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the database and want to continue to do so, please check both boxes in that row.) |
| PlatformWorkedWith | Which of the following platforms have you done extensive development work for over the past year? (If you both developed for the platform and want to continue to do so, please check both boxes in that row.) |
| PlatformDesireNextYear | Which of the following platforms have you done extensive development work for over the past year? (If you both developed for the platform and want to continue to do so, please check both boxes in that row.) |
| WebFrameWorkedWith | Which of the following web frameworks have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the framework and want to continue to do so, please check both boxes in that row.) |
| WebFrameDesireNextYear | Which of the following web frameworks have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the framework and want to continue to do so, please check both boxes in that row.) |
| MiscTechWorkedWith | Which of the following other frameworks, libraries, and tools have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the technology and want to continue to do so, please check both boxes in that row.) |
| MiscTechDesireNextYear | Which of the following other frameworks, libraries, and tools have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the technology and want to continue to do so, please check both boxes in that row.) |
| DevEnviron | Which development environment(s) do you use regularly? Please check all that apply. |
| OpSys | What is the primary operating system in which you work? |
| Containers | How do you use containers (Docker, Open Container Initiative (OCI), etc.)? |
| BlockchainOrg | How is your organization thinking about or implementing blockchain technology? |
| BlockchainIs | Blockchain / cryptocurrency technology is primarily: |
| BetterLife | Do you think people born today will have a better life than their parents? |
| ITperson | Are you the “IT support person” for your family? |
| OffOn | Have you tried turning it off and on again? |
| SocialMedia | What social media site do you use the most? |
| Extraversion | Do you prefer online chat or IRL conversations? |
| ScreenName | What do you call it? |
| SOVisit1st | To the best of your memory, when did you first visit Stack Overflow? |
| SOVisitFreq | How frequently would you say you visit Stack Overflow? |
| SOVisitTo | I visit Stack Overflow to… (check all that apply) |
| SOFindAnswer | On average, how many times a week do you find (and use) an answer on Stack Overflow? |
| SOTimeSaved | Think back to the last time you solved a coding problem using Stack Overflow, as well as the last time you solved a problem using a different resource. Which was faster? |
| SOHowMuchTime | About how much time did you save? If you’re not sure, please use your best estimate. |
| SOAccount | Do you have a Stack Overflow account? |
| SOPartFreq | How frequently would you say you participate in Q&A on Stack Overflow? By participate we mean ask, answer, vote for, or comment on questions. |
| SOJobs | Have you ever used or visited Stack Overflow Jobs? |
| EntTeams | Have you ever used Stack Overflow for Enterprise or Stack Overflow for Teams? |
| SOComm | Do you consider yourself a member of the Stack Overflow community? |
| WelcomeChange | Compared to last year, how welcome do you feel on Stack Overflow? |
| SONewContent | Would you like to see any of the following on Stack Overflow? Check all that apply. |
| Age | What is your age (in years)? If you prefer not to answer, you may leave this question blank. |
| Gender | Which of the following do you currently identify as? Please select all that apply. If you prefer not to answer, you may leave this question blank. |
| Trans | Do you identify as transgender? |
| Sexuality | Which of the following do you currently identify as? Please select all that apply. If you prefer not to answer, you may leave this question blank. |
| Ethnicity | Which of the following do you identify as? Please check all that apply. If you prefer not to answer, you may leave this question blank. |
| Dependents | Do you have any dependents (e.g., children, elders, or others) that you care for? |
| SurveyLength | How do you feel about the length of the survey this year? |
| SurveyEase | How easy or difficult was this survey to complete? |
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m1_survey_data.csv"
df = pd.read_csv(dataset_url)
df.head()
| Respondent | MainBranch | Hobbyist | OpenSourcer | OpenSource | Employment | Country | Student | EdLevel | UndergradMajor | ... | WelcomeChange | SONewContent | Age | Gender | Trans | Sexuality | Ethnicity | Dependents | SurveyLength | SurveyEase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | I am a developer by profession | No | Never | The quality of OSS and closed source software ... | Employed full-time | United States | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 22.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Easy |
| 1 | 9 | I am a developer by profession | Yes | Once a month or more often | The quality of OSS and closed source software ... | Employed full-time | New Zealand | No | Some college/university study without earning ... | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | NaN | 23.0 | Man | No | Bisexual | White or of European descent | No | Appropriate in length | Neither easy nor difficult |
| 2 | 13 | I am a developer by profession | Yes | Less than once a month but more than once per ... | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United States | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | Somewhat more welcome now than last year | Tech articles written by other developers;Cour... | 28.0 | Man | No | Straight / Heterosexual | White or of European descent | Yes | Appropriate in length | Easy |
| 3 | 16 | I am a developer by profession | Yes | Never | The quality of OSS and closed source software ... | Employed full-time | United Kingdom | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | NaN | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 26.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Neither easy nor difficult |
| 4 | 17 | I am a developer by profession | Yes | Less than once a month but more than once per ... | The quality of OSS and closed source software ... | Employed full-time | Australia | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 29.0 | Man | No | Straight / Heterosexual | Hispanic or Latino/Latina;Multiracial | No | Appropriate in length | Easy |
5 rows × 85 columns
# occurrence based on all columns
df[df.duplicated()]#.head()
| Respondent | MainBranch | Hobbyist | OpenSourcer | OpenSource | Employment | Country | Student | EdLevel | UndergradMajor | ... | WelcomeChange | SONewContent | Age | Gender | Trans | Sexuality | Ethnicity | Dependents | SurveyLength | SurveyEase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1168 | 2339 | I am a developer by profession | Yes | Once a month or more often | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United States | No | Some college/university study without earning ... | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | NaN | 24.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Easy |
| 1169 | 2342 | I am a developer by profession | Yes | Never | The quality of OSS and closed source software ... | Employed full-time | United Kingdom | No | Some college/university study without earning ... | Information systems, information technology, o... | ... | Somewhat more welcome now than last year | Tech meetups or events in your area;Courses on... | 24.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Too long | Easy |
| 1170 | 2343 | I am a developer by profession | Yes | Less than once a month but more than once per ... | OSS is, on average, of LOWER quality than prop... | Employed full-time | Canada | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | Somewhat more welcome now than last year | Tech articles written by other developers;Indu... | 27.0 | Man | No | Straight / Heterosexual | Black or of African descent;White or of Europe... | No | Appropriate in length | Neither easy nor difficult |
| 1171 | 2344 | I am a developer by profession | Yes | Never | The quality of OSS and closed source software ... | Employed full-time | United States | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 24.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Easy |
| 1172 | 2347 | I am a developer by profession | Yes | Never | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United Kingdom | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | NaN | NaN | Woman | No | Straight / Heterosexual | Biracial | No | Too long | Easy |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2297 | 4674 | I am not primarily a developer, but I write co... | Yes | Less than once per year | The quality of OSS and closed source software ... | Employed full-time | Bangladesh | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | NaN | ... | Somewhat less welcome now than last year | Tech articles written by other developers;Indu... | 31.0 | Man | No | Bisexual;Gay or Lesbian;Straight / Heterosexual | Black or of African descent;Hispanic or Latino... | Yes | Too long | Neither easy nor difficult |
| 2298 | 4675 | I am a developer by profession | Yes | Never | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United States | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Information systems, information technology, o... | ... | Just as welcome now as I felt last year | Tech meetups or events in your area | 27.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Easy |
| 2299 | 4676 | I am a developer by profession | Yes | Never | OSS is, on average, of HIGHER quality than pro... | Employed full-time | Finland | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Another engineering discipline (ex. civil, ele... | ... | Somewhat less welcome now than last year | NaN | 36.0 | Man | No | Straight / Heterosexual | White or of European descent | Yes | Too long | Easy |
| 2300 | 4677 | I am a developer by profession | Yes | Once a month or more often | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United Kingdom | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | A natural science (ex. biology, chemistry, phy... | ... | Just as welcome now as I felt last year | NaN | 40.0 | Man | No | Straight / Heterosexual | White or of European descent | Yes | Appropriate in length | Easy |
| 2301 | 4679 | I am a developer by profession | Yes | Less than once a month but more than once per ... | The quality of OSS and closed source software ... | Employed full-time | United States | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | NaN | 27.0 | Man | No | NaN | White or of European descent | No | Appropriate in length | Easy |
154 rows × 85 columns
# occurrence based on all columns
df[~df.duplicated()]
| Respondent | MainBranch | Hobbyist | OpenSourcer | OpenSource | Employment | Country | Student | EdLevel | UndergradMajor | ... | WelcomeChange | SONewContent | Age | Gender | Trans | Sexuality | Ethnicity | Dependents | SurveyLength | SurveyEase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | I am a developer by profession | No | Never | The quality of OSS and closed source software ... | Employed full-time | United States | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 22.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Easy |
| 1 | 9 | I am a developer by profession | Yes | Once a month or more often | The quality of OSS and closed source software ... | Employed full-time | New Zealand | No | Some college/university study without earning ... | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | NaN | 23.0 | Man | No | Bisexual | White or of European descent | No | Appropriate in length | Neither easy nor difficult |
| 2 | 13 | I am a developer by profession | Yes | Less than once a month but more than once per ... | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United States | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | Somewhat more welcome now than last year | Tech articles written by other developers;Cour... | 28.0 | Man | No | Straight / Heterosexual | White or of European descent | Yes | Appropriate in length | Easy |
| 3 | 16 | I am a developer by profession | Yes | Never | The quality of OSS and closed source software ... | Employed full-time | United Kingdom | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | NaN | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 26.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Neither easy nor difficult |
| 4 | 17 | I am a developer by profession | Yes | Less than once a month but more than once per ... | The quality of OSS and closed source software ... | Employed full-time | Australia | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 29.0 | Man | No | Straight / Heterosexual | Hispanic or Latino/Latina;Multiracial | No | Appropriate in length | Easy |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11547 | 25136 | I am a developer by profession | Yes | Never | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United States | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Cour... | 36.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Difficult |
| 11548 | 25137 | I am a developer by profession | Yes | Never | The quality of OSS and closed source software ... | Employed full-time | Poland | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | A lot more welcome now than last year | Tech articles written by other developers;Tech... | 25.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Neither easy nor difficult |
| 11549 | 25138 | I am a developer by profession | Yes | Less than once per year | The quality of OSS and closed source software ... | Employed full-time | United States | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | A lot more welcome now than last year | Tech articles written by other developers;Indu... | 34.0 | Man | No | Straight / Heterosexual | White or of European descent | Yes | Too long | Easy |
| 11550 | 25141 | I am a developer by profession | Yes | Less than once a month but more than once per ... | OSS is, on average, of LOWER quality than prop... | Employed full-time | Switzerland | No | Secondary school (e.g. American high school, G... | NaN | ... | Somewhat less welcome now than last year | NaN | 25.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Easy |
| 11551 | 25142 | I am a developer by profession | Yes | Less than once a month but more than once per ... | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United Kingdom | No | Other doctoral degree (Ph.D, Ed.D., etc.) | A natural science (ex. biology, chemistry, phy... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Tech... | 30.0 | Man | No | Bisexual | White or of European descent | No | Appropriate in length | Easy |
11398 rows × 85 columns
df = df.drop_duplicates()
df
| Respondent | MainBranch | Hobbyist | OpenSourcer | OpenSource | Employment | Country | Student | EdLevel | UndergradMajor | ... | WelcomeChange | SONewContent | Age | Gender | Trans | Sexuality | Ethnicity | Dependents | SurveyLength | SurveyEase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | I am a developer by profession | No | Never | The quality of OSS and closed source software ... | Employed full-time | United States | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 22.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Easy |
| 1 | 9 | I am a developer by profession | Yes | Once a month or more often | The quality of OSS and closed source software ... | Employed full-time | New Zealand | No | Some college/university study without earning ... | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | NaN | 23.0 | Man | No | Bisexual | White or of European descent | No | Appropriate in length | Neither easy nor difficult |
| 2 | 13 | I am a developer by profession | Yes | Less than once a month but more than once per ... | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United States | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | Somewhat more welcome now than last year | Tech articles written by other developers;Cour... | 28.0 | Man | No | Straight / Heterosexual | White or of European descent | Yes | Appropriate in length | Easy |
| 3 | 16 | I am a developer by profession | Yes | Never | The quality of OSS and closed source software ... | Employed full-time | United Kingdom | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | NaN | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 26.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Neither easy nor difficult |
| 4 | 17 | I am a developer by profession | Yes | Less than once a month but more than once per ... | The quality of OSS and closed source software ... | Employed full-time | Australia | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Indu... | 29.0 | Man | No | Straight / Heterosexual | Hispanic or Latino/Latina;Multiracial | No | Appropriate in length | Easy |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11547 | 25136 | I am a developer by profession | Yes | Never | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United States | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Cour... | 36.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Difficult |
| 11548 | 25137 | I am a developer by profession | Yes | Never | The quality of OSS and closed source software ... | Employed full-time | Poland | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | A lot more welcome now than last year | Tech articles written by other developers;Tech... | 25.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Neither easy nor difficult |
| 11549 | 25138 | I am a developer by profession | Yes | Less than once per year | The quality of OSS and closed source software ... | Employed full-time | United States | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | ... | A lot more welcome now than last year | Tech articles written by other developers;Indu... | 34.0 | Man | No | Straight / Heterosexual | White or of European descent | Yes | Too long | Easy |
| 11550 | 25141 | I am a developer by profession | Yes | Less than once a month but more than once per ... | OSS is, on average, of LOWER quality than prop... | Employed full-time | Switzerland | No | Secondary school (e.g. American high school, G... | NaN | ... | Somewhat less welcome now than last year | NaN | 25.0 | Man | No | Straight / Heterosexual | White or of European descent | No | Appropriate in length | Easy |
| 11551 | 25142 | I am a developer by profession | Yes | Less than once a month but more than once per ... | OSS is, on average, of HIGHER quality than pro... | Employed full-time | United Kingdom | No | Other doctoral degree (Ph.D, Ed.D., etc.) | A natural science (ex. biology, chemistry, phy... | ... | Just as welcome now as I felt last year | Tech articles written by other developers;Tech... | 30.0 | Man | No | Bisexual | White or of European descent | No | Appropriate in length | Easy |
11398 rows × 85 columns
# Evaluating for Missing Data using either .isnull() or .notnull()
df['WorkLoc'].isnull().value_counts()
False 11366 True 32 Name: WorkLoc, dtype: int64
df['ConvertedComp'].isnull().value_counts()
False 10582 True 816 Name: ConvertedComp, dtype: int64
When to impute with mode
# calculate the most common value in the Workloc column
mcv = df['WorkLoc'].value_counts().idxmax()
# replace the missing 'WorkLoc' values with the most frequent
df["WorkLoc"].replace(np.nan,mcv, inplace=True)
# Verify if imputing was successful
df['WorkLoc'].isnull().value_counts()
False 11398 Name: WorkLoc, dtype: int64
When to impute with median
# calculate median compensation
medcomp = df['ConvertedComp'].median()
# replace the missing 'ConvertedComp' values with the median
df["ConvertedComp"].replace(np.nan,medcomp, inplace=True)
# Verify if imputing was successful
df['ConvertedComp'].isnull().value_counts()
False 11398 Name: ConvertedComp, dtype: int64
There are two columns in the dataset that talk about compensation. One is "CompFreq". This column shows how often a developer is paid (Yearly, Monthly, Weekly).The other is "CompTotal". This column talks about how much the developer is paid per Year, Month, or Week depending upon his/her "CompFreq". This makes it difficult to compare the total compensation of the developers.
Create a new column called 'NormalizedAnnualCompensation' which contains the 'Annual Compensation' irrespective of the 'CompFreq'.
Once this column is ready, it makes comparison of salaries easy.
df['CompFreq'].value_counts()
Yearly 6073 Monthly 4788 Weekly 331 Name: CompFreq, dtype: int64
def calculate_value(row):
if row == 'Yearly':
return 1
elif row == 'Monthly':
return 12
elif row == 'Weekly':
return 52
else:
return None
# Apply the function to create 'Column B'
df['CompFreqVal'] = df['CompFreq'].apply(calculate_value)
df['NormalizedAnnualCompensation'] = df['CompTotal']*df['CompFreqVal']
df['NormalizedAnnualCompensation'].head()
0 61000.0 1 138000.0 2 90000.0 3 348000.0 4 90000.0 Name: NormalizedAnnualCompensation, dtype: float64
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv")
The column ConvertedComp contains Salary converted to annual USD salaries using the exchange rate on 2019-02-01.
This assumes 12 working months and 50 working weeks.
Plot the distribution curve for the column ConvertedComp
# sns.distplot(df ['ConvertedComp'].dropna(),hist=False)
# df['ConvertedComp'].fillna(df['ConvertedComp'].mean(), inplace=True)
# sns.distplot (df['ConvertedComp'], hist = False)
sns.displot (df['ConvertedComp'], kind = 'kde')
<seaborn.axisgrid.FacetGrid at 0x28558379350>
Plot the histogram for the column ConvertedComp
count, bin_edges = np.histogram(df['ConvertedComp'].dropna())
df['ConvertedComp'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)
plt.title('Histogram of Salary converted to annual USD salaries') # add a title to the histogram
plt.ylabel('Salary in USD') # add y-label
plt.xlabel('Number of Salary') # add x-label
plt.show()
Give the five number summary for the column Age?
df['Age'].describe()
count 11111.000000 mean 30.778895 std 7.393686 min 16.000000 25% 25.000000 50% 29.000000 75% 35.000000 max 99.000000 Name: Age, dtype: float64
Plot a histogram of the column Age.
count, bin_edges = np.histogram(df['Age'].dropna())
df['Age'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)
plt.title('Histogram of Age') # add a title to the histogram
plt.ylabel('Age') # add y-label
plt.xlabel('Count') # add x-label
plt.show()
Find out if outliers exist in the column ConvertedComp using a box plot
df['ConvertedComp'].plot(kind='box', figsize=(15,7))
plt.title('Box plot of Salalry in USD')
plt.ylabel('Number of Immigrants')
plt.show()
Based on the boxplot of ‘Age’ how many outliers do you see below Q1?
Ans: Zero"
Find out the Inter Quartile Range for the column ConvertedComp.
Q1 = df['ConvertedComp'].quantile(0.25)
Q3 = df['ConvertedComp'].quantile(0.75)
IQR = Q3 - Q1
IQR
73132.0
Find out the upper and lower bounds.
print("Upper Bound: {}".format(df['ConvertedComp'].max()))
print("Lower Bound: {}".format(df['ConvertedComp'].min()))
Upper Bound: 2000000.0 Lower Bound: 0.0
Identify how many outliers are there in the ConvertedComp column.
((df['ConvertedComp'] < (Q1 - 1.5 * IQR)) | (df['ConvertedComp'] > (Q3 + 1.5 * IQR))).sum()
879
Create a new dataframe by removing the outliers from the ConvertedComp column.
mask = (df['ConvertedComp'] < (Q1 - 1.5 * IQR)) | (df['ConvertedComp'] > (Q3 + 1.5 * IQR))
df[mask] = np.nan
df['ConvertedComp'].mean()
59883.20838915799
Find the correlation between Age and all other numerical columns.
df.corr()
C:\Users\alfre\AppData\Local\Temp\ipykernel_2100\1134722465.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. df.corr()
| Respondent | CompTotal | ConvertedComp | WorkWeekHrs | CodeRevHrs | Age | |
|---|---|---|---|---|---|---|
| Respondent | 1.000000 | -0.019364 | 0.010878 | -0.015275 | 0.002980 | 0.003950 |
| CompTotal | -0.019364 | 1.000000 | -0.063561 | 0.004975 | 0.017536 | 0.006371 |
| ConvertedComp | 0.010878 | -0.063561 | 1.000000 | 0.034351 | -0.088934 | 0.401821 |
| WorkWeekHrs | -0.015275 | 0.004975 | 0.034351 | 1.000000 | 0.031963 | 0.037452 |
| CodeRevHrs | 0.002980 | 0.017536 | -0.088934 | 0.031963 | 1.000000 | -0.017961 |
| Age | 0.003950 | 0.006371 | 0.401821 | 0.037452 | -0.017961 | 1.000000 |
Visualizing the distribution of data
Visualizing relationship
Visualizing composition of data
Visualizing comparison of data
After data has been collected, cleaned and organized, the work of interpretation begins. You are now able to obtain a complete view of the data and hopefully answer the questions that were formed before starting the analysis.Now you typically compose a findings report that explains what was learned. Depending on the audience, the report can be Depending on the audience, the report can be
As a data scientist, you are expected to do thorough analysis with the appropriate data, deploying the appropriate tools. As a writer, you are responsible for communicating your findings to the readers. Transport Policy, a leading research publication in transportation planning, offers a checklist for authors interested in publishing with the journal. The checklist is a series of questions authors are expected to consider before submitting their manuscripts to the journal. I believe the checklist is useful for budding data scientists and, therefore, I have reproduced it verbatim for their benefit.
Have you told readers, at the outset, what they might gain by reading your paper?
Have you made the aim of your work clear?
Have you explained the significance of your contribution?
Have you set your work in the appropriate context by giving sufficient background (including a complete set of relevant references) to your work?
Have you addressed the question of practicality and usefulness?
Have you identified future developments that might result from your work?
Have you structured your paper in a clear and logical fashion?
some items that seem intereesting to the analyst may not be relevant to the project. trying to explain every little detail to your audience and not recognizing irrelevant data could damage the key message