Fechner sign correlation coefficient. Linear correlation coefficient Sign correlation coefficient

And some ranking coefficients

In addition to those discussed in subsection. 10.2 coefficient of cor-

Relation, coefficient of determination, correlation

Wearing, there are other coefficients for evaluation

The degree of closeness of the correlation between the studied

Phenomena, and the formula for finding them is enough

Simple. Let's look at some of these coefficients.

Fechner sign correlation coefficient

This coefficient is the simplest indicator

The degree of closeness of connection, it was proposed by a German scientist

G. Fechner. This indicator is based on an assessment of the degree

Consistency of directions of individual deviations

The values ​​of the factor and resultant characteristics from the corresponding

Relevant average values. To determine it, calculate

The average values ​​of the resultant () and factorial () are shown.

signs, and then find signs of deviations from the average for

All values ​​of the resultant and factor characteristics. If

the value being compared is greater than the average, then a “+” sign is placed,

and if less - the “-” sign. Matching of characters for individual

series values x and y means consistent variation, and their

Inconsistency is a violation of consistency.

The Fechner coefficient is found using the following formula:

, (10.40)

Where WITH- number of matches of individual deviation signs

New values ​​from the average value;

N is the number of discrepancies in the signs of deviations of individual

New values ​​from the average value.

Note that -1 ≤ Kf≤ 1. When Kf= ±1 we have a complete direct

mutual or reverse consistency. At Kf= 0 - connection between

There are no rows of observations.

Using the initial data of example 10.1, we calculate the coefficient

Ent Fechner. The necessary data to determine its location is

tim in the table. 10.4.

From the table 10.4 we find that WITH= 6; N= 0, therefore according to the form-

le (10.40) we obtain: , i.e., a complete direct dependence

between weapons thefts ( X) and armed criminals

yami ( y). Received value Kf confirms the conclusion made

After calculating the correlation coefficient, it is clear that

There is a fairly close straight line between the rows x and y

Linear dependence.

Table 10.4

Theft

weapons, x

Armed

crimes, y

Signs of deviation from the average

773 4481 − −

1130 9549 − −

1138 8873 − −

1336 12160 + +

1352 18059 + +

1396 19154 + +

Spearman's rank correlation coefficient

This coefficient refers to ranking, i.e. correlation

It is not the values ​​of the factor and resultant values ​​themselves that are determined;

Signs, and their ranks (numbers of their places occupied in each row

Values ​​in ascending or descending order). Cor-

Spearman's rank relations are based on considering the difference

Ranks of factor and resultant characteristics values. For

to find it, the following formula is used:

, (10.41)

Where is the square of the rank difference.

Let's calculate the Spearman coefficient based on the data

Example 10.1. Since the value of factor recognition

ka X we initially arranged them in ascending order, then the series X ran-

no need to fatten. We rank (from smallest to largest) the series y.

All necessary data for the calculation are placed in the table. 10.5.

Table 10.5

Ranks Rgx row X Ranks Rgy row y|di| = |RgxiRgyi|

Now, using formula (10.41), we obtain

Note that -1 ≤ ρ c≤ 1, i.e. the resulting value shows

It is clear that between weapons theft and armed crime

General understanding of correlation-regression analysis

The forms and types of connections that exist between phenomena are very diverse in their classification. are only those that are quantitative in nature and are studied using quantitative methods. Let us consider the method of correlation-regression analysis, which is fundamental in the study of the relationships between phenomena.

This method contains its two constituent parts— correlation analysis and regression analysis. Correlation analysis is a quantitative method for determining the strength and direction of the relationship between sample variables. Regression analysis is a quantitative method for determining the type of mathematical function in the cause-and-effect relationship between variables.

To assess the strength of a connection in the correlation theory, the English statistician Chaddock scale is used: weak - from 0.1 to 0.3; moderate - from 0.3 to 0.5; noticeable - from 0.5 to 0.7; high - from 0.7 to 0.9; very high (strong) - from 0.9 to 1.0. It is used further in examples on the topic.

Linear correlation

This correlation characterizes a linear relationship in variations of variables. It can be paired (two correlated variables) or multiple (more than two variables), direct or inverse - positive or negative, when the variables vary in the same or different directions, respectively.

If the variables are quantitative and equivalent in their independent observations with their total number, then the most important empirical measures of the closeness of their linear relationship are the coefficient of direct correlation of signs of the Austrian psychologist G.T. Fechner (1801-1887) and the coefficients of paired, pure (private) and multiple (cumulative) correlation of the English statistician-biometrician K. Pearson (1857-1936).

Fechner sign pair correlation coefficient determines the consistency of directions in individual deviations of variables from their averages and . It is equal to the ratio of the difference between the sums of matching () and mismatching () pairs of signs in deviations and to the sum of these sums:

Magnitude Kf varies from -1 to +1. The summation in (1) is made over observations that are not listed in the sums for the sake of simplicity. If any one deviation or , then it is not included in the calculation. If both deviations are zero at once: , then such a case is considered to have the same signs and is included in . In table 12.1. the preparation of data for calculation (1) is shown.

Table 12.1 Data for calculating the Fechner coefficient.

Number of employees, thousand people

Trade turnover, c.u.

Deviation from average

Comparison of signs and

coincidence
(From to)

mismatch (N k)

By (1) we have K f = (3 - 2)/(3 + 2) = 0.20. The direction of the relationship in the variations!!Average number of employees|number of employees]] and is positive (straightforward): the signs in the deviations and and in the majority (in 3 cases out of 5) coincide with each other. The closeness of the relationship between variables on the Chaddock scale is weak.

Pearson's pair, pure (partial) and multiple (total) linear correlation coefficients, in contrast to the Fechner coefficient, take into account not only the signs, but also the magnitudes of deviations of the variables. Different methods are used to calculate them. Thus, according to the direct counting method for ungrouped data, the Pearson pair correlation coefficient has the form:

This coefficient also varies from -1 to +1. If there are several variables, the Pearson multiple (cumulative) linear correlation coefficient is calculated. For three variables x, y, z it looks like

This coefficient varies from 0 to 1. If we eliminate (completely exclude or fix at a constant level) the influence on and , then their “general” relationship will turn into a “pure” one, forming a pure (partial) Pearson linear correlation coefficient:

This coefficient varies from -1 to +1. The squares of correlation coefficients (2)-(4) are called coefficients (indices) of determination - pair, pure (particular), multiple (total), respectively:

Each of the coefficients of determination varies from 0 to 1 and evaluates the degree of variational certainty in the linear relationship of variables, showing the proportion of variation in one variable (y) due to the variation of the other (others) - x and y. The multivariate case of more than three variables is not considered here.

According to the developments of the English statistician R.E. Fisher (1890-1962), the statistical significance of paired and pure (partial) Pearson correlation coefficients is checked if their distribution is normal, based on the distribution of the English statistician V.S. Gosset (pseudonym "Student"; 1876-1937) with a given level of probabilistic significance and available degree of freedom, where is the number of connections (factor variables). For a paired coefficient we have its root mean square error and the actual value of the Student’s t-test:

For a pure correlation coefficient, when calculating it, instead of (n-2), it is necessary to take , because in this case there is m=2 (two factor variables x and z). For a large number n>100, instead of (n-2) or (n-3) in (6), you can take n, neglecting the accuracy of the calculation.

If t r > t table, then the pair correlation coefficient - total or pure - is statistically significant, and when t r ≤ t tab.- insignificant.

The significance of the multiple correlation coefficient R is checked by F— Fisher criterion by calculating its actual value

At F R > F tab. the coefficient R is considered significant with a given significance level a and the available degrees of freedom and , and with F r ≤ F table- insignificant.

In large-volume populations n > 100, the normal distribution law (tabulated Laplace-Sheppard function) is used directly to assess the significance of all Pearson coefficients instead of the t and F tests.

Finally, if the Pearson coefficients do not obey the normal law, then Z is used as a criterion for their significance - Fisher's test, which is not considered here.

Conditional calculation example(2) - (7) is given in table. 12.2, where the initial data of Table 12.1 is taken with the addition of a third variable z - the size of the total area of ​​the store (100 sq. m).

Table 12.2.Preparing data to calculate Pearson correlation coefficients

Indicators

According to (2) - (5), the Pearson linear correlation coefficients are equal to:

Relationship of Variables x And y is positive, but not close, amounting to a magnitude based on their paired correlation coefficient and a magnitude based on the pure correlation coefficient, and was assessed on the Chaddock scale, respectively, as “noticeable” and “weak.”

Determination coefficients d xy =0.354 And dxy. z = 0.0037 indicate that the variation at(turnover) is due to linear variation x(number of employees) by 35,4% in their general interrelation and in pure interrelation - only on 0,37% . This situation is due to the significant impact on x And y third variable z— total area occupied by stores. The closeness of its relationship with them is, respectively, r xz =0.677 and r yz =0.844.

The multiple (cumulative) correlation coefficient of three variables shows that the closeness of the linear relationship x And z c y amounts to R = 0.844, assessed on the Chaddock scale as “high”, and the multiple determination coefficient is the value D=0.713, indicating that 71,3 % the whole variation at(trade turnover) are determined by the cumulative impact of variables on it x And z. Rest 28,7% due to the impact on y other factors or a curvilinear relationship of variables y, x, z.

To assess the significance of correlation coefficients, we take the significance level . According to the initial data, we have degrees of freedom for and for . According to the theoretical table, we find t table 1, respectively. = 3.182 and t table 2. = 4.303. For the F-test we have and and from the table we find F table. = 19.0. The actual values ​​of each criterion according to (6) and (7) are equal to:

All calculated criteria are less than their table values: all Pearson correlation coefficients are statistically insignificant.

  • Kendall's rank correlation coefficient.
    The calculation formula has the form: We rank all elements according to the characteristic x^, according to a series of another characteristic x 10 ): Where ia/2 - quantile determined from the normal distribution table for the selected significance level a (for example, for a = 0.05 we get ia/2 = 1.96). If P 10, then they calculate...
    (Multivariate statistical methods in economics)
  • Correlation coefficients of indicators of the state of regional subsystems with the investment indicator
    Fertility rate -0.08 (p = 0.768) 0.10 (p = 0.707) Death rate -0.36 (p = 0.158) -0.65 (p = 0.004) Infant mortality rate -0.13 (p = 0.619) ) -0.40 (p = 0.113) Population 0.98 (p = 0.000) 0.62 (p = 0.008) Life expectancy at birth, years 0.20...
    (Development of regions: diagnostics of regional differences)
  • Correlation coefficients of indicators of the state of regional subsystems with the investment indicator
    Fertility rate -0.08 (p = 0.768) 0.10 (p = 0.707) Death rate -0.36 (p = 0.158) -0.65 (p = 0.004) Infant mortality rate -0.13 (p = 0.619) ) -0.40 (p = 0.113) Population 0.98 (p = 0.000) 0.62 (p = 0.008) Life expectancy at birth, years 0.20...
    (Development of regions: diagnostics of regional differences)
  • Spearman's rank correlation coefficient
    This coefficient refers to ranking ones, i.e., it is not the values ​​of the factor and resultant characteristics themselves that are correlated, but their ranks (the numbers of their places occupied in each row of values ​​in ascending or descending order). Spearman's rank correlation coefficient is based on considering the difference in ranks of factor values...
    (General theory of statistics)

The correlation coefficient, proposed in the second half of the 19th century by G. T. Fechner, is the simplest measure of the relationship between two variables. It is based on a comparison of two psychological characteristics x i And y i, measured on the same sample, by comparing the signs of deviations of individual values ​​from the average: and
. The conclusion about the correlation between two variables is made based on counting the number of matches and mismatches of these signs.

Example

Let x i And y i– two traits measured on the same sample of subjects. To calculate the Fechner coefficient, it is necessary to calculate the average values ​​for each characteristic, as well as for each value of the variable - the sign of the deviation from the average (Table 8.1):

Table 8.1

x i

y i

Designation

In the table: A– coincidence of signs, b– mismatch of signs; n a – number of matches, n b – number of mismatches (in this case n a = 4, n b = 6).

The Fechner correlation coefficient is calculated using the formula:

(8.1)

In this case:

Conclusion

There is a weak negative relationship between the studied variables.

It should be noted that the Fechner correlation coefficient is not a sufficiently strict criterion, so it can only be used at the initial stage of data processing and to formulate preliminary conclusions.

8. 4. Pearson correlation coefficient

The original principle of the Pearson correlation coefficient is the use of the product of moments (deviations of the value of a variable from the average value):

If the sum of the products of moments is large and positive, then X And at are directly related; if the sum is large and negative, then X And at strongly inversely related; finally, if there is no connection between x And at the sum of the products of moments is close to zero.

To ensure that statistics do not depend on the sample size, the average value is taken rather than the sum of the products of moments. However, the division is made not by the sample size, but by the number of degrees of freedom n - 1.

Magnitude
is a measure of the connection between X And at and is called covariance X And at.

In many problems in the natural and technical sciences, covariance is a completely satisfactory measure of connection. Its disadvantage is that the range of its values ​​is not fixed, i.e. it can vary within indefinite limits.

In order to standardize a measure of association, it is necessary to free the covariance from the influence of standard deviations. To do this you need to divide S xy on s x and s y:

(8.3)

Where r xy- correlation coefficient, or product of Pearson moments.

The general formula for calculating the correlation coefficient is as follows:

(some conversions)

(8.4)

Impact of data conversion on r xy:

1. Linear transformations x And y type bx + a And dy + c will not change the magnitude of the correlation between x And y.

2. Linear transformations x And y at b < 0, d> 0, and also when b> 0 and d < 0 изменяют знак коэффициента корреляции, не меняя его величины.

The reliability (or, otherwise, statistical significance) of the Pearson correlation coefficient can be determined in different ways:

According to the tables of critical values ​​of the Pearson and Spearman correlation coefficients (see Appendix, Table XIII). If the value obtained in the calculations r xy exceeds the critical (tabular) value for a given sample, the Pearson coefficient is considered statistically significant. The number of degrees of freedom in this case corresponds to n– 2, where n– number of pairs of compared values ​​(sample size).

According to Table XV of the Appendix, which is entitled “The number of pairs of values ​​required for the statistical significance of the correlation coefficient.” In this case, it is necessary to focus on the correlation coefficient obtained in the calculations. It is considered statistically significant if the sample size is equal to or greater than the tabulated number of pairs of values ​​for a given coefficient.

According to the Student coefficient, which is calculated as the ratio of the correlation coefficient to its error:

(8.5)

Correlation coefficient error calculated using the following formula:

Where m r - correlation coefficient error, r- correlation coefficient; n- number of pairs being compared.

Let us consider the procedure of calculations and determination of the statistical significance of the Pearson correlation coefficient using the example of solving the following problem.

The task

22 high school students were tested on two tests: USK (level of subjective control) and MkU (motivation for success). The following results were obtained (Table 8.2):

Table 8.2

USK ( x i)

MkU ( y i)

USK ( x i)

MkU ( y i)

Exercise

To test the hypothesis that people with a high level of internality (USC score) are characterized by a high level of motivation to succeed.

Solution

1. We use the Pearson correlation coefficient in the following modification (see formula 8.4):

For the convenience of data processing on a microcalculator (in the absence of the necessary computer program), it is recommended to create an intermediate work table of the following form (Table 8.3):

Table 8.3

x i y i

x 1 y 1

x 2 y 2

x 3 y 3

x n y n

Σ x i y i

2. We carry out calculations and substitute the values ​​into the formula:

3. We determine the statistical significance of the Pearson correlation coefficient in three ways:

1st method:

In table XIII Appendix we find the critical values ​​of the coefficient for the 1st and 2nd significance levels: r cr.= 0.42; 0.54 (ν = n – 2 = 20).

We conclude that r xy > r cr . , i.e. the correlation is statistically significant for both levels.

2nd method:

Let's use the table. XV, in which we determine the number of pairs of values ​​(number of subjects) sufficient for the statistical significance of the Pearson correlation coefficient equal to 0.58: for the 1st, 2nd and 3rd significance levels it is 12, 18 and 28, respectively .

From here we conclude that the correlation coefficient is significant for the 1st and 2nd levels, but “does not reach” the 3rd level of significance.

3rd method:

We calculate the error of the correlation coefficient and the Student coefficient as the ratio of the Pearson coefficient to the error:

In table X we find the standard values ​​of the Student coefficient for the 1st, 2nd and 3rd significance levels with the number of degrees of freedom ν = n – 2 = 20: t cr. = 2,09; 2,85; 3,85.

General conclusion

The correlation between the indicators of the USC and MkU tests is statistically significant for the 1st and 2nd levels of significance.

Note:

When interpreting the Pearson correlation coefficient, the following points must be considered:

    The Pearson coefficient can be used for various scales (ratio, interval, or ordinal) with the exception of the dichotomous scale.

    A correlation does not always mean a cause-and-effect relationship. In other words, if we found, say, a positive correlation between height and weight in a group of subjects, this does not mean that height depends on weight or vice versa (both of these characteristics depend on a third (external) variable, which in this case is associated with genetic constitutional characteristics of a person).

    r xu » 0 can be observed not only in the absence of connection between x And y, but also in the case of a strong nonlinear connection (Fig. 8.2 a). In this case, the negative and positive correlations are balanced, resulting in the illusion of no connection.

    r xy may be quite small if there is a strong connection between X And at observed in a narrower range of values ​​than the one studied (Fig. 8.2 b).

    Combining samples with different means can create the illusion of a fairly high correlation (Fig. 8.2 c).

y i y i y i

+ + . .

x i x i x i

Rice. 8.2. Possible sources of errors when interpreting the value of the correlation coefficient (explanations in the text (points 3 – 5 notes))

Conclusions:

The resulting value of the sign correlation coefficient is zero, since the number of matches and the number of sign mismatches are equal. This is the main drawback of this indicator. Based on this indicator, it can be assumed that there is no relationship.

Linear correlation coefficient

Checking the significance of the correlation coefficient:

Conclusions:

The obtained value of the linear correlation coefficient indicates that the relationship between the share in the total supply of burned fuels and life expectancy at birth is moderate, indicating the presence of an inverse relationship.

Therefore, with a probability of 95% we can assume that the correlation is still significant.

Empirical correlation ratio:

Checking the significance of an empirical correlation relationship:

Conclusions:

The obtained value of the empirical correlation ratio indicates a moderate relationship between the characteristics under study.

Therefore, with a probability of 95% we can conclude that the correlation between the analyzed indicators is insignificant.

Spearman's rank correlation coefficient:

Conclusions:

Based on the results of calculating the Spearman coefficient, it can be assumed that there is a weak inverse relationship between the share in the total supply of burned fuels and life expectancy at birth.

Kendal Rank Correlation Coefficient:

Conclusions:

Based on the calculated rank correlation coefficient, we can assume that there is a weak inverse relationship between the characteristics under study.

· Testing the possibility of using a linear function as a form of relationship

It is considered possible to use a linear correlation equation, but to test the hypothesis of a linear dependence it is more effective to use the quantity .

Conclusions:

Therefore, the hypothesis about the linearity of the relationship between the share in the total supply of burned fuels and life expectancy at birth is correct.



Countries with an average level of human development

· Identification of the existence of a relationship between a factor and a resultant characteristic

Analytical grouping

Empirical regression line


Conclusions:

Comparing the average values ​​of the resulting characteristic by group, one can see the following trend: the higher the share in the total supply of burned fuels, the longer the life expectancy at birth (if we do not take into account jumps, possibly due to other factors), i.e. we can assume the presence direct correlation between characteristics.

Correlation field


Conclusions:

The main part of the units forms a cloud, located mainly from the lower left corner of the coordinate system to the upper right corner, it can be assumed that there is a direct relationship between the characteristics.

Correlation table

When grouping by factor characteristic, the number of groups is 6. When grouping by effective characteristic, we will set the number of groups equal to the number of groups by factor characteristic, i.e. We also exclude countries for which there is no data on the factor attribute; the number of countries has been reduced to thirty, i.e.

Now we create a correlation table:

Correlation table Average life expectancy at birth, years
52,0-57,2 57,2-62,4 62,4-67,6 67,6-70,1 70,1-72,6 72,6-75,1 Total
Share in the total volume of supplies of burned fuels, % 15-30
30-45
45-60
60-75
75-90
90-100
Total

Conclusions:

It is difficult to determine the direction of the correlation relationship, mainly the frequencies in the correlation table are located on the diagonal from the upper left corner to the lower right corner, i.e., large values ​​of the factor characteristic correspond to large values ​​of the resultant one, therefore, we can assume the presence of a direct correlation between the characteristics.

· Indicators for assessing the degree of closeness of the relationship