Statistics: The Art and Science of Learning from Data.

An Overview of Statistics

*What You Should Learn*

The definition of statistics

How to distinguish between a population and a sample and between a parameter and a statistic

How to distinguish between descriptive statistics and inferential statistics

Statistics is the science of collecting, organizing, analyzing and interpreting data in order to make decisions.

- Data consist of information coming from observations, counts, measurements, or responses. The singular of data is datum.
- Subjects: Are the objects or individuals that we measure in a study. Subjects can be people or any other objects of interest, e.g., schools, cities, countries, days, months, etc.
- Population is the collection of all outcomes, responses, measurements or counts that are of interest.
- Sample is a subset of a population.

*Example 1*

Identifying data Sets

In a recent survey, 3002 adults in the United States were asked if they read news on the Internet at least once a week. Six hundred of the adults said yes. Identify the population and the sample. Describe the data set. *(Source: Pew Research Center)*

*Solution*

- The population consists of the responses of all adults in the United States.
- The sample consists of the responses of the 3002 adults in the United States in the survey. The sample is a subset of the responses of all adults in the United States. The data set consists of 600 yes’s and 2402 no’s.

*Practice Problem 1*

The U.S. Department of Energy conducts weekly surveys of approximately 900 gasoline stations to determine the average price per gallon of regular gasoline. On December 29, 2003, the average prices was 1.478 per gallon. Identify the population and the sample.*(Source: U.S. Department of Energy)*

** Note:** Whether a data set is a population or a sample usually depends on the context of the real-life situation.

- Parameter is a numerical description of a population characteristics.
- Population standard deviation: \(\sigma\).
- Population variance: \(\sigma^2\).
- Population mean: \(\mu\).

- Statistic is a numerical description of a sample characteristics.
- Sample mean: \(\bar{x}\).
- Sample standard deviation: \(s\).
- Sample variance: \({S}^2\).

*Example 2*

Distinguishing between a Parameter and a Statistic

Decide whether the numerical value describes a population parameter or a sample statistic. Explain your reasoning.

A recent survey of a sample of MBA’s reported that the average starting salary for an MBA is less than $65,000.

*(Source: The Washington Post Company)*Starting salaries for the 667 MBA graduates from the University of Chicago Graduate School of Business increased by 8.5% from previous year.

In a random check of a sample retail stores, the Food and Drug Administration found that 34% of the stores were not storing fish at the proper temperature.

*Solution*

Because the average of $65,000 is based on a subset of the population, it is a sample statistic.

Because the percent increase of 8.5% is based on all 667 graduates’ starting salaries, it is a population parameter.

Because the percent of 34% is based on a subset of the population, it is a sample statistic.

The study of statistics has two major branches: *descriptive statistics* and *inferential statististics*.

Description Statistics Descritpive Statistics is the branch of statistics that deals with the collection, organization, summarization, and display of data.

Inferential Statistics Inference Statistics is the branch of statistics that deals with the use of a sample to draw conclusions about a population. A basic tool in the study of inferential statistics is probability.

*Example 3*

Descriptive and Inferential Statistics

Decide which part of the study represents the descriptive branch of statistics. What conclusions might be drawn from the study using inferential statistics?

A large sample of men, aged 48, was studied for 18 years. For unmarried men, approximately 70% were alive at age 65. For married men, 90% were alive at age 65.

*(Source: The Journal of Family Issues)*In a sample of Wall Street analysts, the percentage who incorrectly forecasted high-tech earnings in a recent year was 44%.

*(Source: Bloomberg News)*

*Solution*

Descriptive statistics involves statemtns such as “For unmarried men, approximately 70% were alive at age 65” and “For married men, 90% were alive at age 65.” A possible inference drawn from the study is that being married is associated with a longer life for men.

The part of this study that represents the descriptive branch of statistics involves the statement “the percentage of Wall Street analysts who incorrectly forecasted high-tech earnings in a recent year was 44%.” A possible inference drawn from the study is the stock market is difficult to forecast, even for professionals.

Data Classification

*What You Should Learn*

How to distinguish between quantitative data and qualitative data

How to classify data with respect to the four levels of measurement: nominal, ordinal, interval, and ratio

There are two major types of data: ** quantitative data** and

** Quantitative data** consist of numerical measurements or counts. Examples: height, weight, speed of a car, number of houses, etc.

** Qualitative data** consist of attributes, labels, or nonnumerical entries. Examples: gender, race, country of origin, etc.

Data at the **nominal level of measurement** are qualitative only. Data at this level are categorized by using names, labels, or qualities. E.g. Zip codes, names of network affiliates, etc. No mathematical computations can be made at this level.

Data at the **ordinal level of measurement** are qualitative or quantitative. Data at this level can be arranged in order, but differences between data entries are not meaningful. E.g. Grammy Awards, letter grades, movie ranking, etc.

Data at the **interval level of measurement** are quantitative. The data can be ordered, and your can calculate meaningful differences between data entries. At the interval level, a zero entry simply represents a position on a scale; the entry is not an inherent zero. E.g. temperature, State Gov’t Tax collections by year, etc.

Data at the **ratio level of measurement** are similar to data at the interval level, with the added property that a zero entry is an inherent zero. A ratio of two data values can be formed so one data value can be expressed as a multiple of another. E.g. home prices, volumes, fish lengths, etc.

Experimental Design

*What You Should Learn*

How to design a statistical study

How to collect data by doing an observational study, performing an experiment, using a simulation, or using survey

How to create a sample using random sampling, simple random sampling, stratified sampling, cluster sampling, and systematic sampling and how to identify a biased sample

The purpose of any statistical study is to use sample information to make data-informed decisions about a general population of interest.

- Identify the variable(s) of interest (the focus) and the population of the study.
- Develop a detailed plan for collecting data. If you use a sample, make sure the sample is representative of the population.
- Collect data.
- Describe the data using descriptive statistics techniques.
- Interpret the data and make decisions about the population using inferential statistics.
- Identify any possible errors.

Research data can be collected in several ways depending on the focus of one’s study. Below are four methods of data collection:

In an observation study, a researcher observes and measures characteristics of interest of part of a population.*Do an observational study*When performing an experiment, a treatment is applied to part of the a population (*Perform an experiment**treatment group*) and responses are observed and recorded.

A major distinction between the observational study and an experiment is that a researcher does not manipulate the subjects in an observation, whereas the researcher applies a treatment to one group of the subjects (treatment group ) and a placebo to the control group, in an experiment. The experiment in which both the researcher and the subjects do not know which subjects are receiving a placebo is called a *double blind* experiment. And the experiment in which the researcher knows which subjects are receiving the placebo (control group) and which one aren’t (treatment group) is called a single-blind experiment.

A*Use a simulation**simulation*is the use of a mathematical or physical model to reproduce the conditions of a situation or process. Simulations enable us to study situations that are impractical or even dangerous to create in real life, and often they save time and money. Simulations are often used by automobile manufacturers to study the effects of crashes on humans.A*Use a survey*is an investigation of one or more characteristics of a population. Most often, surveys are carried out on people by asking them questions.*survey*- A
is a count or measure of part of an entire population.*census* A

is a count or measure of part of a population.*sampling*

When collecting data it is imperative to watch for biases. Below are the types of sampling methods that are used to collect unbiased data:

- A
**random sample**is a sampling method in which every member of the population has an equal chance of being chosen. - A
**simple random sample**is a sample in which every sample of the sample size has the same likelihood of being chosen. - A sampling process in which a member of a population can be selected more than once is called a sampling with replacement.
- A sampling process in which a member of a population can not be selected more than once is called a sampling without replacement.

Sometimes statistics can be used either wittingly or unwittingly to mislead the readers. For instance, a researcher may deliberately choose a biased sample to achieve his or her objective(s). Or in another situation, a researcher may ask questions that encourage respondents to either intentionally or unintentionally answer the questions in a certain way.

For attribution, please cite this work as

Reng (2019, Jan. 6). Reng Data Science Institute: Chapter One. Retrieved from https://www.rengdatascience.io/posts/2019-01-06-chapter-one/

BibTeX citation

@misc{reng2019chapter, author = {Reng, Alier Ëë}, title = {Reng Data Science Institute: Chapter One}, url = {https://www.rengdatascience.io/posts/2019-01-06-chapter-one/}, year = {2019} }