Exploratory analysis on suicide data

6 minute read

Suicide analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Reading the data

df = pd.read_csv('master.csv')

Let’s take a look at the data:

df.sample(5)

	country	year	sex	age	suicides_no	population	suicides/100k pop	country-year	HDI for year	gdp_for_year ($)	gdp_per_capita ($)	generation
12171	Ireland	1994	female	15-24 years	13	303000	4.29	Ireland1994	NaN	57,166,037,102	17188	Generation X
11049	Guatemala	2014	female	5-14 years	13	1906536	0.68	Guatemala2014	0.627	58,722,323,918	4210	Generation Z
23290	South Africa	1996	male	35-54 years	36	4031071	0.89	South Africa1996	NaN	147,607,982,695	3908	Boomers
7902	Ecuador	2002	male	55-74 years	39	525419	7.42	Ecuador2002	NaN	28,548,945,000	2472	Silent
5936	Colombia	2010	male	75+ years	65	418296	15.54	Colombia2010	0.706	287,018,184,638	6836	Silent

df.describe()

	year	suicides_no	population	suicides/100k pop	HDI for year	gdp_per_capita ($)
count	27820.000000	27820.000000	2.782000e+04	27820.000000	8364.000000	27820.000000
mean	2001.258375	242.574407	1.844794e+06	12.816097	0.776601	16866.464414
std	8.469055	902.047917	3.911779e+06	18.961511	0.093367	18887.576472
min	1985.000000	0.000000	2.780000e+02	0.000000	0.483000	251.000000
25%	1995.000000	3.000000	9.749850e+04	0.920000	0.713000	3447.000000
50%	2002.000000	25.000000	4.301500e+05	5.990000	0.779000	9372.000000
75%	2008.000000	131.000000	1.486143e+06	16.620000	0.855000	24874.000000
max	2016.000000	22338.000000	4.380521e+07	224.970000	0.944000	126352.000000

The dataset has data from suicides from 1985 to 2016.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 12 columns):
country               27820 non-null object
year                  27820 non-null int64
sex                   27820 non-null object
age                   27820 non-null object
suicides_no           27820 non-null int64
population            27820 non-null int64
suicides/100k pop     27820 non-null float64
country-year          27820 non-null object
HDI for year          8364 non-null float64
 gdp_for_year ($)     27820 non-null object
gdp_per_capita ($)    27820 non-null int64
generation            27820 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 2.5+ MB

is there null data?

df.isnull().sum()

country                   0
year                      0
sex                       0
age                       0
suicides_no               0
population                0
suicides/100k pop         0
country-year              0
HDI for year          19456
 gdp_for_year ($)         0
gdp_per_capita ($)        0
generation                0
dtype: int64

Understanding the data

The country-year field displays the country name and year of the record. In this way, it is a redundant field and will be discarded. Also due to most data from the ‘HDI for year’ field, it will be discarded.

df.drop(['country-year', 'HDI for year'], inplace=True, axis = 1)

Let’s rename some columns simply to make it easier to access them.

df = df.rename(columns={'gdp_per_capita ($)': 'gdp_per_capita', ' gdp_for_year ($) ':'gdp_for_year'})

In this case, the ‘gdp_for_year’ field is as a string, so let’s convert this to a number.

for i, x in enumerate(df['gdp_for_year']):
    df['gdp_for_year'][i] = x.replace(',', '')

df['gdp_for_year'] = df['gdp_for_year'].astype('int64')

Data Description

Each data in the data set represents a year, a country, a certain age range, and a gender. For example, in the country Brazil in the year 1985, over 75 years, committed suicide 129 men.

The data set has 10 attributes. These being:

Country: country of record data;
Year: year of record data;
Sex: Sex (male or female);
Age: Suicide age range, ages divided into six categories;
Suicides_no: number of suicides;
Population: population of this sex, in this age range, in this country and in this year;
Suicides / 100k pop: Reason between the number of suicides and the population / 100k;
GDP_for_year: GDP of the country in the year who issue;
GDP_per_capita: ratio between the country’s GDP and its population;
Generation: Generation of the suicides in question, being possible 6 different categories.

Possible age categories and generations are:

df['age'].unique()

array(['15-24 years', '35-54 years', '75+ years', '25-34 years',
       '55-74 years', '5-14 years'], dtype=object)

df['generation'].unique()

array(['Generation X', 'Silent', 'G.I. Generation', 'Boomers',
       'Millenials', 'Generation Z'], dtype=object)

Adding some things

As the HDI was discarded and it is very interesting to assess whether the development of the country has an influence on the suicide rate, I have separated a list of first and second world countries from the data of the site:

http://worldpopulationreview.com

Then I categorized each country in the data set into first, second and third world.

Frist_world = ['United States', 'Germany', 'Japan', 'Turkey', 'United Kingdom', 'France', 'Italy', 'South Korea',
              'Spain', 'Canada', 'Australia', 'Netherlands', 'Belgium', 'Greece', 'Portugal',
              'Sweden', 'Austria', 'Switzerland', 'Israel', 'Singapore', 'Denmark', 'Finland', 'Norway', 'Ireland',
              'New Zeland', 'Slovenia', 'Estonia', 'Cyprus', 'Luxembourg', 'Iceland']

Second_world = ['Russian Federation', 'Ukraine', 'Poland', 'Uzbekistan', 'Romania', 'Kazakhstan', 'Azerbaijan', 'Czech Republic',
               'Hungary', 'Belarus', 'Tajikistan', 'Serbia', 'Bulgaria', 'Slovakia', 'Croatia', 'Maldova', 'Georgia',
               'Bosnia And Herzegovina', 'Albania', 'Armenia', 'Lithuania', 'Latvia', 'Brazil', 'Chile', 'Argentina',
               'China', 'India', 'Bolivia', 'Romenia']

country_world = []
for i in range(len(df)):

    if df['country'][i] in Frist_world:
        country_world.append('1')
    elif df['country'][i] in Second_world:
        country_world.append('2')
    else:
        country_world.append('3')

df['country_world'] = country_world

Exploratory analysis

I will analyze the impact of some attributes on the amount of suicides. We start this year.

Year

suicides_no_year = []

for y in df['year'].unique():
    suicides_no_year.append(sum(df[df['year'] == y]['suicides_no']))

n_suicides_year = pd.DataFrame(suicides_no_year, columns=['suicides_no_year'])
n_suicides_year['year'] = df['year'].unique()

top_year = n_suicides_year.sort_values('suicides_no_year', ascending=False)['year']
top_suicides = n_suicides_year.sort_values('suicides_no_year', ascending=False)['suicides_no_year']

plt.figure(figsize=(8,5))
plt.xticks(rotation=90)
sns.barplot(x = top_year, y = top_suicides)

Age

suicides_no_age = []

for a in df['age'].unique():
    suicides_no_age.append(sum(df[df['age'] == a]['suicides_no']))

plt.xticks(rotation=30)
sns.barplot(x = df['age'].unique(), y = suicides_no_age)

Sex

suicides_no_sex = []

for s in df['sex'].unique():
    suicides_no_sex.append(sum(df[df['sex'] == s]['suicides_no']))

sns.barplot(x = df['sex'].unique(), y = suicides_no_sex)

<matplotlib.axes._subplots.AxesSubplot at 0x7ff84a0f1b00>

sns.catplot(x='sex', y='suicides_no',col='age', data=df, estimator=np.median,height=4, aspect=.7,kind='bar')

Country

Countries with larger populations should have more suicides.

suicides_no_pais = []
for c in df['country'].unique():
    suicides_no_pais.append(sum(df[df['country'] == c]['suicides_no']))

n_suicides_pais = pd.DataFrame(suicides_no_pais, columns=['suicides_no_pais'])
n_suicides_pais['country'] = df['country'].unique()

quant = 15
top_paises = n_suicides_pais.sort_values('suicides_no_pais', ascending=False)['country'][:quant]
top_suicides = n_suicides_pais.sort_values('suicides_no_pais', ascending=False)['suicides_no_pais'][:quant]
sns.barplot(x = top_suicides, y = top_paises)

By using the amount of suicides per 100k inhabitants, we remove the bias of overpopulated countries.

suicides_no_pais = []
for c in df['country'].unique():
    suicides_no_pais.append(sum(df[df['country'] == c]['suicides/100k pop']))

n_suicides_pais = pd.DataFrame(suicides_no_pais, columns=['suicides_no_pais/100k'])
n_suicides_pais['country'] = df['country'].unique()

quant = 15
top_paises = n_suicides_pais.sort_values('suicides_no_pais/100k', ascending=False)['country'][:quant]
top_suicides = n_suicides_pais.sort_values('suicides_no_pais/100k', ascending=False)['suicides_no_pais/100k'][:quant]
sns.barplot(x = top_suicides, y = top_paises)

Generation

suicides_no_gen = []
for g in df['generation'].unique():
    suicides_no_gen.append(sum(df[df['generation'] == g]['suicides_no']))

plt.figure(figsize=(8,5))
sns.barplot(x = df['generation'].unique(), y = suicides_no_gen)

Country world

suicides_no_world = []
for w in df['country_world'].unique():
    suicides_no_world.append(sum(df[df['country_world'] == w]['suicides_no']))

sns.barplot(x = df['country_world'].unique(), y = suicides_no_world)

suicides_no_world = []
for w in df['country_world'].unique():
    suicides_no_world.append(sum(df[df['country_world'] == w]['suicides/100k pop']))

sns.barplot(x = df['country_world'].unique(), y = suicides_no_world)

GDP for year

sns.scatterplot(x = 'gdp_for_year', y = 'suicides_no', data = df)

GDP por capita

sns.scatterplot(x = 'gdp_per_capita', y = 'suicides_no', data = df)

Attribute Correlation

plt.figure(figsize=(8,7))
sns.heatmap(df.corr(), cmap = 'coolwarm', annot=True)

Checking the suicidade/100k distribution of some countries

countries = ['Russian Federation', 'Brazil', 'Poland', 'Italy', 'United States', 'Germany', 'Japan', 'Spain', 'France']
df_filtred = df[[df['country'][i] in countries for i in range(len(df))]]

plt.figure(figsize=(12,6))
sns.boxplot(x = 'suicides/100k pop', y = 'country', data = df_filtred)

General Plot of the World

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

cod = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv')

codes = []
for i in range(len(n_suicides_pais)):
    c = n_suicides_pais['country'][i]
    f = 0
    for j in range(len(cod)):
        if c == cod['COUNTRY'][j]:
            tmp = cod['CODE'][j]
            f = 1
            break
    if f == 0:
        if c == 'Bahamas':
            tmp  = 'BHM'
        elif c == 'Republic of Korea':
            tmp = 'KOR'
        elif c == 'Russian Federation':
            tmp = 'RUS'
        else:
            tmp = 'VC'
    codes.append(tmp)

data = dict(
        type = 'choropleth',
        locations = codes,
        z = n_suicides_pais['suicides_no_pais/100k'],
        text = n_suicides_pais['country'],
        colorbar = {'title' : 'número de suicídios'},
      )

layout = dict(
    title = 'Mapa de calor de suicídios 1985-2016',
    geo = dict(
        showframe = False,
        projection = {'type':'equirectangular'}
    )
)

choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)

Brazil Facts

As a Brazilian, I have a particular interest in the suicide rate in Brazil. So I’m going to try to analyze the specific indices of this country.

df_brasil = df[df['country'] == 'Brazil']

Country and country fields are all the same, then discarded.

df_brasil.drop(['country', 'country_world'], axis = 1, inplace = True)

I’m going to repeat a lot of the graphics already done.

suicides_no_year = []

for y in df_brasil['year'].unique():
    suicides_no_year.append(sum(df_brasil[df_brasil['year'] == y]['suicides_no']))

n_suicides_year = pd.DataFrame(suicides_no_year, columns=['suicides_no_year'])
n_suicides_year['year'] = df_brasil['year'].unique()

top_year = n_suicides_year.sort_values('suicides_no_year', ascending=False)['year']
top_suicides = n_suicides_year.sort_values('suicides_no_year', ascending=False)['suicides_no_year']

plt.figure(figsize=(8,5))
plt.xticks(rotation=90)
sns.barplot(x = top_year, y = top_suicides)

suicides_no_age = []

for a in df['age'].unique():
    suicides_no_age.append(sum(df_brasil[df_brasil['age'] == a]['suicides_no']))

plt.xticks(rotation=30)
sns.barplot(x = df_brasil['age'].unique(), y = suicides_no_age)

suicides_no_sex = []

for s in df['sex'].unique():
    suicides_no_sex.append(sum(df_brasil[df_brasil['sex'] == s]['suicides_no']))

sns.barplot(x = df_brasil['sex'].unique(), y = suicides_no_sex)

suicides_no_gen = []
for g in df['generation'].unique():
    suicides_no_gen.append(sum(df_brasil[df_brasil['generation'] == g]['suicides_no']))

plt.figure(figsize=(8,5))
sns.barplot(x = df_brasil['generation'].unique(), y = suicides_no_gen)

sns.scatterplot(x = 'gdp_for_year', y = 'suicides_no', data = df_brasil)

sns.scatterplot(x = 'gdp_per_capita', y = 'suicides_no', data = df_brasil)

plt.figure(figsize=(8,7))
sns.heatmap(df_brasil.corr(), cmap = 'coolwarm', annot=True)

Share on

Twitter Facebook LinkedIn

Luis Meazzini