import_spss
: Importing data
from ‘SPSS’import_spss()
allows importing data from
SPSS
(.sav
and .zsav
files) into
R
by using the R
package
haven
.
This vignette illustrates a typical workflow of importing a
SPSS
file using import_spss()
and
extractData()
. For illustrative purposes we use a small
example data set from the campus files of the German PISA Plus
assessment. The complete campus files and the original data set can be
accessed here
and here.
We can import an .sav
data set via the
import_spss()
function. Checks on variable names (for data
base compatibility) are performed automatically. Changes to the variable
names are reported to the console. This behavior can be suppressed by
setting checkVarNames = FALSE
.
GADSdat
objectsThe resulting object is of the class GADSdat
. It is
basically a named list containing the actual data (dat
) and
the meta data (labels
).
The names of the variables in a GADSdat
object can be
accessed via the namesGADS()
function. The meta data of
variables can be accessed via the extractMeta()
function.
namesGADS(gads_obj)
#> [1] "idstud" "idschool" "idclass" "schtype" "sameteach" "g8g9"
#> [7] "ganztag" "classsize" "repeated" "gender" "age" "language"
#> [13] "migration" "hisced" "hisei" "homepos" "books" "pared"
#> [19] "computer_age" "internet_age" "int_use_a" "int_use_b" "truancy_a" "truancy_b"
#> [25] "truancy_c" "int_a" "int_b" "int_c" "int_d" "instmot_a"
#> [31] "instmot_b" "instmot_c" "instmot_d" "norms_a" "norms_b" "norms_c"
#> [37] "norms_d" "norms_e" "norms_f" "anxiety_a" "anxiety_b" "anxiety_c"
#> [43] "anxiety_d" "anxiety_e" "selfcon_a" "selfcon_b" "selfcon_c" "selfcon_d"
#> [49] "selfcon_e" "worketh_a" "worketh_b" "worketh_c" "worketh_d" "worketh_e"
#> [55] "worketh_f" "worketh_g" "worketh_h" "worketh_i" "intent_a" "intent_b"
#> [61] "intent_c" "intent_d" "intent_e" "behav_a" "behav_b" "behav_c"
#> [67] "behav_d" "behav_e" "behav_f" "behav_g" "behav_h" "teach_a"
#> [73] "teach_b" "teach_c" "teach_d" "teach_e" "cognact_a" "cognact_b"
#> [79] "cognact_c" "cognact_d" "cognact_e" "cognact_f" "cognact_g" "cognact_h"
#> [85] "cognact_i" "discpline_a" "discpline_b" "discpline_c" "discpline_d" "discpline_e"
#> [91] "relation_a" "relation_b" "relation_c" "relation_d" "relation_e" "belong_a"
#> [97] "belong_b" "belong_c" "belong_d" "belong_e" "belong_f" "belong_g"
#> [103] "belong_h" "belong_i" "attitud_a" "attitud_b" "attitud_c" "attitud_d"
#> [109] "attitud_e" "attitud_f" "attitud_g" "attitud_h" "grade_de" "grade_ma"
#> [115] "grade_bio" "grade_che" "grade_phy" "grade_sci" "ma_pv1" "ma_pv2"
#> [121] "ma_pv3" "ma_pv4" "ma_pv5" "rea_pv1" "rea_pv2" "rea_pv3"
#> [127] "rea_pv4" "rea_pv5" "sci_pv1" "sci_pv2" "sci_pv3" "sci_pv4"
#> [133] "sci_pv5"
extractMeta(gads_obj, vars = c("schtype", "idschool"))
#> varName varLabel format display_width labeled value
#> 2 idschool School-ID F8.0 NA no NA
#> 4 schtype School track F8.0 NA yes 1
#> 5 schtype School track F8.0 NA yes 2
#> 6 schtype School track F8.0 NA yes 3
#> valLabel missings
#> 2 <NA> <NA>
#> 4 Gymnasium (academic track) valid
#> 5 Realschule valid
#> 6 schools with several courses of education valid
Commonly the most informative columns are varLabel
(containing variable labels), value
(referencing labeled
values), valLabel
(containing value labels) and
missings
(is a labeled value a missing value
("miss"
) or not ("valid"
)).
GADSdat
If we want to use the data for analyses in R
we have to
extract it from the GADSdat
object via the function
extractData()
. In doing so, we have to make two important
decisions: (a) how should values marked as missing values be treated
(convertMiss
)? And (b) how should labeled values in general
be treated (convertLabels
, dropPartialLabels
,
convertVariables
)? See ?extractData
for more
details.
## convert labeled variables to characters
dat1 <- extractData(gads_obj, convertLabels = "character")
dat1[1:5, 1:10]
#> idstud idschool idclass schtype sameteach
#> 1 1 127 392 Realschule Yes
#> 2 2 65 201 schools with several courses of education No
#> 3 3 10 34 Gymnasium (academic track) No
#> 4 4 103 319 schools with several courses of education Yes
#> 5 5 57 179 Realschule Yes
#> g8g9 ganztag classsize repeated gender
#> 1 <NA> No 9 Did not repeat a grade Female
#> 2 <NA> No 10 Did not repeat a grade Female
#> 3 G8 - 8 years to abitur No 28 Did not repeat a grade Male
#> 4 <NA> No 12 Did not repeat a grade Male
#> 5 <NA> Yes 25 Did not repeat a grade Female
## leave labeled variables as numeric
dat2 <- extractData(gads_obj, convertLabels = "numeric")
dat2[1:5, 1:10]
#> idstud idschool idclass schtype sameteach g8g9 ganztag classsize repeated gender
#> 1 1 127 392 2 2 NA 1 9 1 1
#> 2 2 65 201 3 1 NA 1 10 1 1
#> 3 3 10 34 1 1 1 1 28 1 2
#> 4 4 103 319 3 2 NA 1 12 1 2
#> 5 5 57 179 2 2 NA 2 25 1 1
## leave labeled variables as numeric but convert some variables to character
dat3 <- extractData(gads_obj, convertLabels = "character",
convertVariables = c("gender", "language"))
dat3[1:5, 1:10]
#> idstud idschool idclass schtype sameteach g8g9 ganztag classsize repeated gender
#> 1 1 127 392 2 2 NA 1 9 1 Female
#> 2 2 65 201 3 1 NA 1 10 1 Female
#> 3 3 10 34 1 1 1 1 28 1 Male
#> 4 4 103 319 3 2 NA 1 12 1 Male
#> 5 5 57 179 2 2 NA 2 25 1 Female
In general, we recommend leaving labeled variables as numeric and
converting values with missing codes to NA
. The latter is
the default behavior for the argument checkMissings
. If
required, values labels can always be accessed via using
extractMeta()
on the GADSdat
object or the
data base.