Exploratory analysis on suicide data

6 minute read

Suicide analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Reading the data

df = pd.read_csv('master.csv')

Let’s take a look at the data:

df.sample(5)
country year sex age suicides_no population suicides/100k pop country-year HDI for year gdp_for_year ($) gdp_per_capita ($) generation
12171 Ireland 1994 female 15-24 years 13 303000 4.29 Ireland1994 NaN 57,166,037,102 17188 Generation X
11049 Guatemala 2014 female 5-14 years 13 1906536 0.68 Guatemala2014 0.627 58,722,323,918 4210 Generation Z
23290 South Africa 1996 male 35-54 years 36 4031071 0.89 South Africa1996 NaN 147,607,982,695 3908 Boomers
7902 Ecuador 2002 male 55-74 years 39 525419 7.42 Ecuador2002 NaN 28,548,945,000 2472 Silent
5936 Colombia 2010 male 75+ years 65 418296 15.54 Colombia2010 0.706 287,018,184,638 6836 Silent
df.describe()
year suicides_no population suicides/100k pop HDI for year gdp_per_capita ($)
count 27820.000000 27820.000000 2.782000e+04 27820.000000 8364.000000 27820.000000
mean 2001.258375 242.574407 1.844794e+06 12.816097 0.776601 16866.464414
std 8.469055 902.047917 3.911779e+06 18.961511 0.093367 18887.576472
min 1985.000000 0.000000 2.780000e+02 0.000000 0.483000 251.000000
25% 1995.000000 3.000000 9.749850e+04 0.920000 0.713000 3447.000000
50% 2002.000000 25.000000 4.301500e+05 5.990000 0.779000 9372.000000
75% 2008.000000 131.000000 1.486143e+06 16.620000 0.855000 24874.000000
max 2016.000000 22338.000000 4.380521e+07 224.970000 0.944000 126352.000000

The dataset has data from suicides from 1985 to 2016.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 12 columns):
country               27820 non-null object
year                  27820 non-null int64
sex                   27820 non-null object
age                   27820 non-null object
suicides_no           27820 non-null int64
population            27820 non-null int64
suicides/100k pop     27820 non-null float64
country-year          27820 non-null object
HDI for year          8364 non-null float64
 gdp_for_year ($)     27820 non-null object
gdp_per_capita ($)    27820 non-null int64
generation            27820 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 2.5+ MB

is there null data?

df.isnull().sum()
country                   0
year                      0
sex                       0
age                       0
suicides_no               0
population                0
suicides/100k pop         0
country-year              0
HDI for year          19456
 gdp_for_year ($)         0
gdp_per_capita ($)        0
generation                0
dtype: int64

Understanding the data

The country-year field displays the country name and year of the record. In this way, it is a redundant field and will be discarded. Also due to most data from the ‘HDI for year’ field, it will be discarded.

df.drop(['country-year', 'HDI for year'], inplace=True, axis = 1)

Let’s rename some columns simply to make it easier to access them.

df = df.rename(columns={'gdp_per_capita ($)': 'gdp_per_capita', ' gdp_for_year ($) ':'gdp_for_year'})

In this case, the ‘gdp_for_year’ field is as a string, so let’s convert this to a number.

for i, x in enumerate(df['gdp_for_year']):
    df['gdp_for_year'][i] = x.replace(',', '')

df['gdp_for_year'] = df['gdp_for_year'].astype('int64')

Data Description

Each data in the data set represents a year, a country, a certain age range, and a gender. For example, in the country Brazil in the year 1985, over 75 years, committed suicide 129 men.

The data set has 10 attributes. These being:

  • Country: country of record data;
  • Year: year of record data;
  • Sex: Sex (male or female);
  • Age: Suicide age range, ages divided into six categories;
  • Suicides_no: number of suicides;
  • Population: population of this sex, in this age range, in this country and in this year;
  • Suicides / 100k pop: Reason between the number of suicides and the population / 100k;
  • GDP_for_year: GDP of the country in the year who issue;
  • GDP_per_capita: ratio between the country’s GDP and its population;
  • Generation: Generation of the suicides in question, being possible 6 different categories.

Possible age categories and generations are:

df['age'].unique()
array(['15-24 years', '35-54 years', '75+ years', '25-34 years',
       '55-74 years', '5-14 years'], dtype=object)
df['generation'].unique()
array(['Generation X', 'Silent', 'G.I. Generation', 'Boomers',
       'Millenials', 'Generation Z'], dtype=object)

Adding some things

As the HDI was discarded and it is very interesting to assess whether the development of the country has an influence on the suicide rate, I have separated a list of first and second world countries from the data of the site:

http://worldpopulationreview.com

Then I categorized each country in the data set into first, second and third world.

Frist_world = ['United States', 'Germany', 'Japan', 'Turkey', 'United Kingdom', 'France', 'Italy', 'South Korea',
              'Spain', 'Canada', 'Australia', 'Netherlands', 'Belgium', 'Greece', 'Portugal',
              'Sweden', 'Austria', 'Switzerland', 'Israel', 'Singapore', 'Denmark', 'Finland', 'Norway', 'Ireland',
              'New Zeland', 'Slovenia', 'Estonia', 'Cyprus', 'Luxembourg', 'Iceland']

Second_world = ['Russian Federation', 'Ukraine', 'Poland', 'Uzbekistan', 'Romania', 'Kazakhstan', 'Azerbaijan', 'Czech Republic',
               'Hungary', 'Belarus', 'Tajikistan', 'Serbia', 'Bulgaria', 'Slovakia', 'Croatia', 'Maldova', 'Georgia',
               'Bosnia And Herzegovina', 'Albania', 'Armenia', 'Lithuania', 'Latvia', 'Brazil', 'Chile', 'Argentina',
               'China', 'India', 'Bolivia', 'Romenia']
country_world = []
for i in range(len(df)):

    if df['country'][i] in Frist_world:
        country_world.append('1')
    elif df['country'][i] in Second_world:
        country_world.append('2')
    else:
        country_world.append('3')

df['country_world'] = country_world

Exploratory analysis

I will analyze the impact of some attributes on the amount of suicides. We start this year.

Year

suicides_no_year = []

for y in df['year'].unique():
    suicides_no_year.append(sum(df[df['year'] == y]['suicides_no']))

n_suicides_year = pd.DataFrame(suicides_no_year, columns=['suicides_no_year'])
n_suicides_year['year'] = df['year'].unique()

top_year = n_suicides_year.sort_values('suicides_no_year', ascending=False)['year']
top_suicides = n_suicides_year.sort_values('suicides_no_year', ascending=False)['suicides_no_year']

plt.figure(figsize=(8,5))
plt.xticks(rotation=90)
sns.barplot(x = top_year, y = top_suicides)

Age

suicides_no_age = []

for a in df['age'].unique():
    suicides_no_age.append(sum(df[df['age'] == a]['suicides_no']))

plt.xticks(rotation=30)
sns.barplot(x = df['age'].unique(), y = suicides_no_age)

Sex

suicides_no_sex = []

for s in df['sex'].unique():
    suicides_no_sex.append(sum(df[df['sex'] == s]['suicides_no']))

sns.barplot(x = df['sex'].unique(), y = suicides_no_sex)
<matplotlib.axes._subplots.AxesSubplot at 0x7ff84a0f1b00>

sns.catplot(x='sex', y='suicides_no',col='age', data=df, estimator=np.median,height=4, aspect=.7,kind='bar')

Country

Countries with larger populations should have more suicides.

suicides_no_pais = []
for c in df['country'].unique():
    suicides_no_pais.append(sum(df[df['country'] == c]['suicides_no']))

n_suicides_pais = pd.DataFrame(suicides_no_pais, columns=['suicides_no_pais'])
n_suicides_pais['country'] = df['country'].unique()

quant = 15
top_paises = n_suicides_pais.sort_values('suicides_no_pais', ascending=False)['country'][:quant]
top_suicides = n_suicides_pais.sort_values('suicides_no_pais', ascending=False)['suicides_no_pais'][:quant]
sns.barplot(x = top_suicides, y = top_paises)

By using the amount of suicides per 100k inhabitants, we remove the bias of overpopulated countries.

suicides_no_pais = []
for c in df['country'].unique():
    suicides_no_pais.append(sum(df[df['country'] == c]['suicides/100k pop']))

n_suicides_pais = pd.DataFrame(suicides_no_pais, columns=['suicides_no_pais/100k'])
n_suicides_pais['country'] = df['country'].unique()

quant = 15
top_paises = n_suicides_pais.sort_values('suicides_no_pais/100k', ascending=False)['country'][:quant]
top_suicides = n_suicides_pais.sort_values('suicides_no_pais/100k', ascending=False)['suicides_no_pais/100k'][:quant]
sns.barplot(x = top_suicides, y = top_paises)

Generation

suicides_no_gen = []
for g in df['generation'].unique():
    suicides_no_gen.append(sum(df[df['generation'] == g]['suicides_no']))

plt.figure(figsize=(8,5))
sns.barplot(x = df['generation'].unique(), y = suicides_no_gen)

Country world

suicides_no_world = []
for w in df['country_world'].unique():
    suicides_no_world.append(sum(df[df['country_world'] == w]['suicides_no']))

sns.barplot(x = df['country_world'].unique(), y = suicides_no_world)

suicides_no_world = []
for w in df['country_world'].unique():
    suicides_no_world.append(sum(df[df['country_world'] == w]['suicides/100k pop']))

sns.barplot(x = df['country_world'].unique(), y = suicides_no_world)

GDP for year

sns.scatterplot(x = 'gdp_for_year', y = 'suicides_no', data = df)

GDP por capita

sns.scatterplot(x = 'gdp_per_capita', y = 'suicides_no', data = df)

Attribute Correlation

plt.figure(figsize=(8,7))
sns.heatmap(df.corr(), cmap = 'coolwarm', annot=True)

Checking the suicidade/100k distribution of some countries

countries = ['Russian Federation', 'Brazil', 'Poland', 'Italy', 'United States', 'Germany', 'Japan', 'Spain', 'France']
df_filtred = df[[df['country'][i] in countries for i in range(len(df))]]

plt.figure(figsize=(12,6))
sns.boxplot(x = 'suicides/100k pop', y = 'country', data = df_filtred)

General Plot of the World

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cod = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv')
codes = []
for i in range(len(n_suicides_pais)):
    c = n_suicides_pais['country'][i]
    f = 0
    for j in range(len(cod)):
        if c == cod['COUNTRY'][j]:
            tmp = cod['CODE'][j]
            f = 1
            break
    if f == 0:
        if c == 'Bahamas':
            tmp  = 'BHM'
        elif c == 'Republic of Korea':
            tmp = 'KOR'
        elif c == 'Russian Federation':
            tmp = 'RUS'
        else:
            tmp = 'VC'
    codes.append(tmp)
data = dict(
        type = 'choropleth',
        locations = codes,
        z = n_suicides_pais['suicides_no_pais/100k'],
        text = n_suicides_pais['country'],
        colorbar = {'title' : 'número de suicídios'},
      )
layout = dict(
    title = 'Mapa de calor de suicídios 1985-2016',
    geo = dict(
        showframe = False,
        projection = {'type':'equirectangular'}
    )
)
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)

Brazil Facts

As a Brazilian, I have a particular interest in the suicide rate in Brazil. So I’m going to try to analyze the specific indices of this country.

df_brasil = df[df['country'] == 'Brazil']

Country and country fields are all the same, then discarded.

df_brasil.drop(['country', 'country_world'], axis = 1, inplace = True)

I’m going to repeat a lot of the graphics already done.

suicides_no_year = []

for y in df_brasil['year'].unique():
    suicides_no_year.append(sum(df_brasil[df_brasil['year'] == y]['suicides_no']))

n_suicides_year = pd.DataFrame(suicides_no_year, columns=['suicides_no_year'])
n_suicides_year['year'] = df_brasil['year'].unique()

top_year = n_suicides_year.sort_values('suicides_no_year', ascending=False)['year']
top_suicides = n_suicides_year.sort_values('suicides_no_year', ascending=False)['suicides_no_year']

plt.figure(figsize=(8,5))
plt.xticks(rotation=90)
sns.barplot(x = top_year, y = top_suicides)

suicides_no_age = []

for a in df['age'].unique():
    suicides_no_age.append(sum(df_brasil[df_brasil['age'] == a]['suicides_no']))

plt.xticks(rotation=30)
sns.barplot(x = df_brasil['age'].unique(), y = suicides_no_age)

suicides_no_sex = []

for s in df['sex'].unique():
    suicides_no_sex.append(sum(df_brasil[df_brasil['sex'] == s]['suicides_no']))

sns.barplot(x = df_brasil['sex'].unique(), y = suicides_no_sex)

suicides_no_gen = []
for g in df['generation'].unique():
    suicides_no_gen.append(sum(df_brasil[df_brasil['generation'] == g]['suicides_no']))

plt.figure(figsize=(8,5))
sns.barplot(x = df_brasil['generation'].unique(), y = suicides_no_gen)

sns.scatterplot(x = 'gdp_for_year', y = 'suicides_no', data = df_brasil)

sns.scatterplot(x = 'gdp_per_capita', y = 'suicides_no', data = df_brasil)

plt.figure(figsize=(8,7))
sns.heatmap(df_brasil.corr(), cmap = 'coolwarm', annot=True)