Title: | Sample Design, Drawing & Data Analysis Using Data Frames |
---|---|
Description: | Determine sample sizes, draw samples, and conduct data analysis using data frames. It specifically enables you to determine simple random sample sizes, stratified sample sizes, and complex stratified sample sizes using a secondary variable such as population; draw simple random samples and stratified random samples from sampling data frames; determine which observations are missing from a random sample, missing by strata, duplicated within a dataset; and perform data analysis, including proportions, margins of error and upper and lower bounds for simple, stratified and cluster sample designs. |
Authors: | Michael Baldassaro |
Maintainer: | Michael Baldassaro <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.4 |
Built: | 2025-01-27 03:39:45 UTC |
Source: | https://github.com/mbaldassaro/sampler |
Data set containing 2017 Albania election results by polling station published by the Central Election Commission and opened by the Coalition of Domestic Observers & Democracy International.
albania
albania
A data frame with 5362 rows and 45 variables
district, 12 in total
geocode for district
municipality, 61 in total
geocode for municipality
election area zone, 90 in total
village, 373 in total
geocode for village
polling station identifier
number of total registered voters
number of male registered voters
number of female registered voters
number of seats contested by district
name of polling center containing polling stations
type of polling center, 5 in total
number of total registered voters that cast ballots
number of female registered voters that cast ballots
number of male registered voters that cast ballots
number of ballots not used
number of ballots damaged
number of total ballots cast
number of ballots cast that were invalidated
number of valid ballots cast
number of ballots cast for LSI
number of ballots cast for PS
number of ballots cast for PKD
number of ballots cast for SFIDA
number of ballots cast for PR
number of ballots cast for PD
number of ballots cast for PBDKSH
number of ballots cast for ADK
number of ballots cast for PSD
number of ballots cast for AD
number of ballots cast for FRD
number of ballots cast for PDS
number of ballots cast for PDIU
number of ballots cast for AAK
number of ballots cast for MEGA
number of ballots cast for PKSH
number of ballots cast for APD
number of ballots cast for LIBRA
number of seats won by PS
number of seats won by PD
number of seats won by LSI
number of seats won by PDIU
number of seats won by PSD
https://albaniaelectiondata.herokuapp.com/
Calculate proportion and margin of error (unequal-sized cluster sample)
cpro(df, numerator, denominator, ci = 95, na = "", N = 0)
cpro(df, numerator, denominator, ci = 95, na = "", N = 0)
df |
object containing data frame on which to perform analysis |
numerator |
variable in data frame for which you want to calculate proportion and margin of error |
denominator |
variable in data frame containing population sizes of unequal clusters |
ci |
(optional) confidence level for establishing a confidence interval using z-score (defaults to 95; restricted to 80, 85, 90, 95 or 99 as input) |
na |
(optional) value that you want to filter and exclude (defaults to include everything) |
N |
(optional) population universe (e.g. 10000, nrow(df)); if N value is passed as an argument, margin of error will be calculated using fpc |
Returns table of responses (n), proportions, margins of error, lower and upper bounds by factor for a given variable in a stratified sample
[1] Survey Sampling, L. Kish, 1965, Equation 6.3.4 [2] Sampling Techniques, W.G. Cochran, 1977, Equation 3.34
alresults <- ssamp(albania, 890, qarku) cpro(df=alresults, numerator=totalVoters, denominator=zgjedhes, ci=95) cpro(df=alresults, numerator=pd, denominator=validVotes, ci=95, N=5361)
alresults <- ssamp(albania, 890, qarku) cpro(df=alresults, numerator=totalVoters, denominator=zgjedhes, ci=95) cpro(df=alresults, numerator=pd, denominator=validVotes, ci=95, N=5361)
Removes duplicate observations within collected data
dedupe(df, col_name)
dedupe(df, col_name)
df |
object containing data frame of collected data |
col_name |
variable within data frame by which to filter for duplicate values |
Returns table of all data based on unique values within collected data
aldupe <- rsamp(df=albania, n=390, rep=TRUE) dedupe(df=aldupe, col_name=qvKod)
aldupe <- rsamp(df=albania, n=390, rep=TRUE) dedupe(df=aldupe, col_name=qvKod)
Identifies duplicate values within collected data
dupe(df, col_name)
dupe(df, col_name)
df |
object containing data frame of collected data |
col_name |
variable within data frame by which to filter for duplicate values |
Returns table of duplicate values within collected data
aldupe <- rsamp(df=albania, n=390, rep=TRUE) dupe(df=aldupe, col_name=qvKod)
aldupe <- rsamp(df=albania, n=390, rep=TRUE) dupe(df=aldupe, col_name=qvKod)
Data set containing 2017 Albania election observation findings on polling station opening process by the Coalition of Domestic Observers (CDO) CDO conducted a statistically-based observation (SBO) exercise, deploying observers to a random sample of polling stations for the 25 June 2017 Albanian elections. This is a subset of observation data collected by CDO observers that includes data that was used to perform statistical analysis.
opening
opening
A data frame with 524 rows and 19 variables
district, 12 in total
polling station identifier
number of registered voters at the polling station
number of ballot papers at the polling station
type of polling station, public or private
time when polling station opening, in 30 minute ranges
number of commissioners present at polling station
yes-no if polling station enabled voters to cast ballots in secrecy, po or jo
yes-no if polling station provided sufficient space to vote, po or jo
yes-no if campaign materials were removed from inside polling station, po or jo
yes-no if campaign materials were removed from outside polling station, po or jo
yes-no if commissioners completed the opening record checklist sheet, po or jo
yes-no if commissioners checked to ensure the ballot box was empty before opening, po or jo
yes-no if commissioners sealed the ballot box to prevent ballot tampering, po or jo
yes-no if commissioners recorded the seal number on the ballot box, po or jo
yes-no if there were all election materials were available at the polling station, po or jo
yes-no if the polling station was equipped for blind voters, po or jo
yes-no-partially if the polling station was equipped for disabled voters, po or jo or pjeserisht
very good-good-problematic-very problematic an overall assessment of the opening process, shummir,mir,meprob,shumprob
https://ona.io/cdo/35080/216662
Determines sample size by strata using sub-units
psampcalc(df, n, strata, unit, over = 0)
psampcalc(df, n, strata, unit, over = 0)
df |
object containing full sampling data frame (e.g. data) |
n |
sample size (integer) or object containing sample size |
strata |
variable in sampling data frame by which to stratify (e.g. region) |
unit |
variable in sampling data frame containing sub-units (e.g. population) |
over |
(optional) desired oversampling proportion (defaults to 0; takes value between 0 and 1 as input) |
Returns sample size per strata based on sub-units (rounded up to nearest integer)
[1] Sampling Design & Analysis, S. Lohr, 1999, 4.4
Identifies missing points between sample and collected data
rmissing(sampdf, colldf, col_name)
rmissing(sampdf, colldf, col_name)
sampdf |
object containing data frame of sample points |
colldf |
object containing data frame of collected data |
col_name |
common variable (i.e. key) in data frames by which to check for missing points |
Returns table of sample points missing from collected data
Simplified wrapper around dplyr::anti_join()
alsample <- rsamp(df=albania, 544) alreceived <- rsamp(df=alsample, 390) rmissing(sampdf=alsample, colldf=alreceived, col_name=qvKod)
alsample <- rsamp(df=albania, 544) alreceived <- rsamp(df=alsample, 390) rmissing(sampdf=alsample, colldf=alreceived, col_name=qvKod)
Calculate proportion and margin of error (simple random sample)
rpro(df, col_name, ci = 95, na = "", N = 0)
rpro(df, col_name, ci = 95, na = "", N = 0)
df |
object containing data frame on which to perform analysis (e.g. data) |
col_name |
variable in data frame for which you want to calculate proportion and margin of error |
ci |
(optional) confidence level for establishing a confidence interval using z-score (defaults to 95; restricted to 80, 85, 90, 95 or 99 as input) |
na |
(optional) value that you want to filter and exclude (defaults to include everything) |
N |
(optional) population universe (e.g. 10000, nrow(df)); if N value is passed as an argument, margin of error will be calculated using fpc |
Returns table of responses (n), proportions, margins of error, lower and upper bounds by factor for a given variable
[1] Sampling Design & Analysis, S. Lohr, 1999, Equation 2.15
rpro(df=opening, col_name=openTime, ci=95, na="n/a", N=5361)
rpro(df=opening, col_name=openTime, ci=95, na="n/a", N=5361)
Draws simple random sample without replacement
rsamp(df, n, over = 0, rep = FALSE)
rsamp(df, n, over = 0, rep = FALSE)
df |
object containing full sampling data frame (e.g. data) |
n |
sample size (integer) or object containing sample size |
over |
(optional) desired oversampling proportion (defaults to 0; takes value between 0 and 1 as input) |
rep |
(optional) |
Returns simple random sample without replacement
Simplified wrapper around dplyr::sample_n()
rsamp(albania, n=360, over=0.1, rep=FALSE) size <- rsampcalc(nrow(albania), 3, 95, 0.5) randomsample <- rsamp(albania, size)
rsamp(albania, n=360, over=0.1, rep=FALSE) size <- rsampcalc(nrow(albania), 3, 95, 0.5) randomsample <- rsamp(albania, size)
Determines random sample size
rsampcalc(N, e, ci = 95, p = 0.5, over = 0)
rsampcalc(N, e, ci = 95, p = 0.5, over = 0)
N |
population universe (e.g. 10000, nrow(df)) |
e |
tolerable margin of error (integer or float, e.g. 5, 2.5) |
ci |
(optional) confidence level for establishing a confidence interval using z-score (defaults to 95; restricted to 80, 85, 90, 95 or 99 as input) |
p |
(optional) anticipated response distribution (defaults to 0.5; takes value between 0 and 1 as input) |
over |
(optional) desired oversampling proportion (defaults to 0; takes value between 0 and 1 as input) |
Returns appropriate sample size (rounded up to nearest integer)
[1] Sampling Design & Analysis, S. Lohr, 1999, equation 2.17
rsampcalc(N=5361, e=3, ci=95, p=0.5, over=0.1) rsampcalc(nrow(data), 3)
rsampcalc(N=5361, e=3, ci=95, p=0.5, over=0.1) rsampcalc(nrow(data), 3)
Identifies number of missing points by strata between sample and collected data
smissing(sampdf, colldf, strata, col_name)
smissing(sampdf, colldf, strata, col_name)
sampdf |
object containing data frame of sample points |
colldf |
object containing data frame of collected data |
strata |
variable in both data frames by which to stratify |
col_name |
common variable (i.e. key) in data frames by which to check for missing points |
Returns table of number of sample points by strata missing from collected data
Simplified wrapper around dplyr::anti_join()
alsample <- rsamp(df=albania, 544) alreceived <- rsamp(df=alsample, 390) smissing(sampdf=alsample, colldf=alreceived, strata=qarku, col_name=qvKod)
alsample <- rsamp(df=albania, 544) alreceived <- rsamp(df=alsample, 390) smissing(sampdf=alsample, colldf=alreceived, strata=qarku, col_name=qvKod)
Calculate proportion and margin of error (stratified sample)
spro(fulldf, sampdf, strata, col_name, ci = 95, na = "")
spro(fulldf, sampdf, strata, col_name, ci = 95, na = "")
fulldf |
object containing original data frame used to draw sample |
sampdf |
object containing data frame on which to perform analysis |
strata |
variable in both data frames by which to stratify |
col_name |
variable in data frame for which you want to calculate proportion and margin of error |
ci |
(optional) confidence level for establishing a confidence interval using z-score (defaults to 95; restricted to 80, 85, 90, 95 or 99 as input) |
na |
(optional) value that you want to filter and exclude (defaults to include everything) |
Returns table of responses (n), proportions, margins of error, lower and upper bounds by factor for a given variable in a stratified sample
[1] Sampling Design & Analysis, S. Lohr, 1999, 4.6 & 4.7
spro(fulldf=albania, sampdf=opening, strata=qarku, col_name=openTime, ci=95, na="n/a")
spro(fulldf=albania, sampdf=opening, strata=qarku, col_name=openTime, ci=95, na="n/a")
Draws stratifed sample without replacement using proportional allocation
ssamp(df, n, strata, over = 1)
ssamp(df, n, strata, over = 1)
df |
object containing full sampling data frame (e.g. data) |
n |
sample size (integer) or object containing sample size |
strata |
variable in sampling data frame by which to stratify (e.g. region) |
over |
(optional) desired oversampling proportion (defaults to 0; takes value between 0 and 1 as input) |
Returns stratified sample without replacement
ssamp(df=albania, n=360, strata=qarku, over=0.1) size <- rsampcalc(nrow(albania), 3, 95, 0.5) stratifiedsample <- ssamp(albania, size, qarku)
ssamp(df=albania, n=360, strata=qarku, over=0.1) size <- rsampcalc(nrow(albania), 3, 95, 0.5) stratifiedsample <- ssamp(albania, size, qarku)
Determines sample size by strata using proportional allocation
ssampcalc(df, n, strata, over = 0)
ssampcalc(df, n, strata, over = 0)
df |
object containing sampling data frame (e.g. data) |
n |
sample size (integer) or object containing sample size |
strata |
variable in sampling data frame by which to stratify (e.g. region) |
over |
(optional) desired oversampling proportion (defaults to 0; takes value between 0 and 1 as input) |
Returns proportional sample size per strata (rounded up to nearest integer)
[1] Sampling Design & Analysis, S. Lohr, 1999, 4.4
ssampcalc(df=albania, n=544, strata=qarku, over=0.05) size <- rsampcalc(nrow(albania), 3, 95, 0.5) ssampcalc(albania, size, qarku)
ssampcalc(df=albania, n=544, strata=qarku, over=0.05) size <- rsampcalc(nrow(albania), 3, 95, 0.5) ssampcalc(albania, size, qarku)