18 LAB VIII: Inference for categorical data

When we have finished this Lab, we should be able to:

Learning objectives

Applying hypothesis testing for chi-square test of independence
Investigate the association between two categorical variables
Interpret the results

18.1 Introduction

If we want to look at the association between two categorical variables then we can’t use the mean or any similar statistic because we don’t have any variables that have been measured continuously. Therefore, when we have measured categorical variables, we analyze frequencies. That is, we analyze the number of subjects that fall into each combination of categories. We can tabulate these frequencies and this is known as a cross-tabulation table or a two-way table or a contingency table.

18.2 Chi-square test of independence

If we want to see whether there’s an association between two categorical variables we can use the chi-square test, often called chi-square test of independence. This is an extremely elegant statistic based on the simple idea of comparing the frequencies we observe in certain categories to the frequencies we might expect to get in those categories by chance.

Opening the file

We will use the “Survival from Malignant Melanoma” dataset named “meldata”. The data consist of measurements made on patients with malignant melanoma, a type of skin cancer. Each patient had their tumor removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark, between 1962 and 1977.

Open the dataset named “meldata” from the file tab in the menu:

The dataset “meldata” includes 205 patients and has seven variables. We are interested in association between the binary variables ulcer (Absent/Present) and satus (Alive/Died)(Figure 18.1).

Research question

We are interested in the association between tumor ulceration and death from melanoma.

Hypothesis Testsing for the Pearson’s Chi-square test of independence

Null hypothesis and alternative hypothesis

\(H_0\): There is no association between the two categorical variables (they are independent). In other words, there is no difference in the proportion of patients with ulcerated tumors who die compared with non-ulcerated tumors (\(p_{ulcerated} = p_{non-ulcertaed}\)). Also RR=1, OR=1.
\(H_1\): There is association between the two categorical variables (they are dependent). In other words, there is difference in the proportion of patients with ulcerated tumors who die compared with non-ulcerated tumors (\(p_{ulcerated} \neq p_{non-ulcertaed}\)). Also RR≠1, OR≠1.

Assumption

Check if the following assumption is satisfied

For 2x2 tables: All expected counts should be ≥ 5 in each cell of the 2x2 table.

For larger tables: All expected counts should be > 1 and no more than 20% of all cells should have expected counts < 5.

Explore the data with plots and tables

On the Jamovi top menu navigate to

flowchart LR
  A(Analyses) -.-> B(Frequencies) -.-> C(Contingency Tables \nIndependent Samples \nx<sup>2 </sup> test of association)

as shown below in ?fig-chi0.

In the menu at the top, choose Analyses > Frequencies > x² test of association.

The Contingency Tables dialogue box opens. Drag the variable ulcer into the Rows field and the variable status into the Columns field, as shown below (Figure 18.2):

Figure 18.2: Contingency Tables dialogue box. Drag the variable **ulcer** into the Rows and the variable **status** into the Columns.

A. Stacked bar plot

First, it is useful to plot the data as counts. In the Plot section, check the Bar Plot and Stacked bar type, as shown below (Figure 18.3):

Figure 18.3: Check the Bar Plot and Stacked bar type in the Plots section.

A bar plot is generated in the output on our right-hand side, as shown below (Figure 18.4):

Figure 18.4: Stacked bar plot with counts (absolute frequencies).

In practice, it is percentages we are comparing. Let’s create a stacked bar plot with percentages. In the Plot section, now, check the Percentages and within rows, as shown below (Figure 18.5):

Figure 18.5: Select the Y-axis to present within row percentages.

Figure 18.6: Stacked bar plot with row percentages.

Just from the plot in Figure 18.6, death from melanoma in the ulcerated tumor (Present) group is around 45% and in the non-ulcerated (Absent) group around 15%. The number of patients included in the study is not huge, however, this still looks like a real difference given its effect size.

B. Contingency Table

We will create a contingency 2x2 table with row percentages and the expected counts for the binary variables ulcer (Absent/Present) and satus (Alive/Died). In the Cells section, tick the Row and Expected counts boxes, as shown below (Figure 18.7):

Figure 18.7: Select row percentages and expected counts.

We present the generated contingency table in Figure 18.8:

Figure 18.8: Contingency table of **ulcer** and **status**.

From the raw frequencies, there seems to be a large difference, as we noted in Figure 18.6. The proportion of patients with ulcerated tumors who die equals to 46% compared with non-ulcerated tumors 14%.

Additionally, as we observe in Figure 18.8 the assumption of chi-square test is fulfilled (all the expected counts are greater than 5; green cells).

Chi-square test

Finally, we see the results of the chi-square test.

Since p<0.001, there is evidence for an association between the ulcer and status (reject \(H_0\)).

Interpretation

There is evidence for an association between the ulcer and status (reject \(H_0\)). The proportion of patients with ulcerated tumors who died (46%) is significant larger compared with non-ulcerated tumors (14%) (p-value <0.001).

18.3 Risk ratio (RR) and Odds ratio (OR)

First we need to reshape the contingency table. Double click on the variable name ulcer and change the order of levels using the (up/down) arrows as follows (Figure 18.10):

Figure 18.10: Reorder the levels of the variable ulcer.

Similarly, change the order of levels for the variable status (Figure 18.11):

Figure 18.11: Reorder the levels of the variable status.

The reshaped table follows:

Figure 18.12: The reshaped contingency 2x2 table

In the Statistics section, check the Relative risk, Odds ratio and Condidence intervals boxes, as follows (Figure 18.13):

Figure 18.13: Select the relative risk, odds ratio and 95% condidence intervals.

Risk ratio (or relative risk)

First, we calculate the risk ratio (RR) of the 2x2 table in Figure 18.12 by hand:

\[ Risk \ Ratio = \frac{\frac{41}{90}}{\frac{16}{115}} =\frac{0.4556}{0.1391} = 3.27\] Using Jamovi we get the following output (Figure 18.14):

Figure 18.14: The risk ratio (relative risk) with 95% condidence intervals.

The risk of dying is 3.27 times higher for patients with ulcerated tumors compared to non-ulcerated tumors. Note that the 95% confidence interval of the RR (1.97, 5.44) does not include the hypothesized null value of 1.

Odds Ratio

Next, let’s calculate the odds ratio of the 2x2 table in Figure 18.12 by hand:

\[ Odds \ Ratio = \frac{\frac{41}{49}}{\frac{16}{99}} =\frac{0.837}{0.162} = 5.18\]

Using Jamovi we get the following output (Figure 18.15):

Figure 18.15: The odds ratio with 95% condidence intervals.

The odds of dying is 5.18 times higher for patients with ulcerated tumors compared to non-ulcerated tumors patients. Note that the 95% confidence interval of the OR (2.65, 10.13) does not include the hypothesized null value of 1.