


ORIGINAL ARTICLE 

Year : 2016  Volume
: 144
 Issue : 3  Page : 447459 

A single weighting approach to analyze respondentdriven sampling data
Vadivoo Selvaraj^{1}, Kangusamy Boopathi^{1}, Ramesh Paranjape^{2}, Sanjay Mehendale^{1}
^{1} National Institute of Epidemiology, Indian Council of Medical Research, TNHB, Ayapakkam, Chennai, India ^{2} National AIDS Research Institute, Bhosari, Pune, India
Date of Submission  27Jan2015 
Date of Web Publication  20Jan2017 
Correspondence Address: Sanjay Mehendale Indian Council of Medical Research, V. Ramalingaswami Bhawan, Ansari Nagar, New Delhi 110 029 India
Source of Support: None, Conflict of Interest: None  Check 
DOI: 10.4103/09715916.198665
Abstract   
Background and objectives: Respondentdriven sampling (RDS) is widely used to sample hidden populations and RDS data are analyzed using specially designed RDS analysis tool (RDSAT). RDSAT estimates parameters such as proportions. Analysis with RDSAT requires separate weight assignment for individual variables even in a single individual; hence, regression analysis is a problem. RDSanalyst is another advanced software that can perform three methods of estimates, namely, successive sampling method, RDS I and RDS II. All of these are in the process of refinement and need special skill to perform analysis. We propose a simple approach to analyze RDS data for comprehensive statistical analysis using any standard statistical software. Methods: We proposed an approach (RDSMOD  respondent driven samplingmodified) that determines a single normalized weight (similar to RDS II of VolzHeckathorn) for each participant. This approach converts the RDS data into clustered data to account the preexisting relationship between recruits and the recruiters. Further, Taylor's linearization method was proposed for calculating confidence intervals for the estimates. Generalized estimating equation approach was used for regression analysis and parameter estimates of different software were compared. Results: The parameter estimates such as proportions obtained by our approach were matched with those from currently available special software for RDS data. Interpretation & conclusions: The proposed weight was comparable to different weights generated by RDSAT. The estimates were comparable to that by RDS II approach. RDSMOD provided an efficient and easytouse method of estimation and regression accounting interindividual recruits' dependence. Keywords: New approach  regression  respondentdriven sampling data analysis
How to cite this article: Selvaraj V, Boopathi K, Paranjape R, Mehendale S. A single weighting approach to analyze respondentdriven sampling data. Indian J Med Res 2016;144:44759 
How to cite this URL: Selvaraj V, Boopathi K, Paranjape R, Mehendale S. A single weighting approach to analyze respondentdriven sampling data. Indian J Med Res [serial online] 2016 [cited 2019 Oct 13];144:44759. Available from: http://www.ijmr.org.in/text.asp?2016/144/3/447/198665 
It is difficult to map and develop sampling frames for hardtoreach and hidden populations such as injecting drug users (IDUs) or men having sex with men (MSM) due to privacy concerns and closed knit nature of the groups. Hence, conventional probabilitybased sampling methods cannot be applied to sample them. Snowball sampling, key informants sampling and targeted sampling are some of the previously described sampling methods for this population ^{[1],[2],[3]}. However, all these methods have their own limitations and known biases ^{[4]}. Slightly modified form of snowball method has been used to count the rare events such as maternal mortality ^{[5]}.
Respondentdriven sampling (RDS) was introduced by Heckathorn to sample hidden and hardtoreach populations. RDS is a modified form of snowball sampling, with a system for assigning weights to compensate for the unequal selection probability ^{[4]}. RDS starts with identifying prototype individuals known to represent a specific hidden or hardtoreach population termed as 'seeds'. In turn, seeds recruit the first wave of respondents and then the firstwave respondents recruit the second wave of respondents and such successive 'waves' help recruiting respondents until the desired sample size is reached. Although respondents recruit those with whom they have a preexisting relationship, the primary expectation is that the respondents recruit randomly from their personal network ^{[6]}. The probability of inclusion is derived from the extension of Markov Chain (MC) theory and random walk on the network connecting the target population ^{[7],[8],[9],[10]}. This theoretical framework forms the basis for calculating unbiased estimates. As the selection of seeds is nonrandom, the RDS data lack external validity ^{[11]}. However, with attainment of more than six waves, the sample composition is expected to stabilize and become independent of seeds ^{[4],[12]}.
The existing software to analyze RDS data is RDS analysis tool (RDSAT), and it can generate only estimators ^{[13]}. Currently, RDSAT is in the stage of refinement and evolution. RDSAT uses bootstrapping to obtain confidence intervals (CIs) for estimates ^{[8],[14]}. Goel and Salganik introduced an MC argument for population mixing ^{[10]}. They proposed an estimator by weighting the variable obtained from the size of the participants' network and the network pattern focusing on relationships within the network. However, individualized weights have to be obtained for each variable and incorporated in the estimation procedure. Therefore, only one estimate can be made at a time and hence it consumes more time for data analysis. In addition, regression analysis is not possible with RDSAT. Efforts were made to adopt RDS data to regression analysis for adjusting estimates to reflect the targeted population ^{[12],[15]}. Exporting of individualized weights of a chosen variable from RDSAT for conducting univariate regression analysis was attempted. Also, multivariateweighted regression using the weights generated by RDSAT was attempted ^{[16]}. However, RDSAT produces as many weights as the number of variables for each participant and this is the problem in applying multivariate regression to RDS data.
Volz and Heckathorn ^{[6]} generalized HorvitzThompson estimator to adopt RDS estimation to survey sampling (RDS II) and this was found to outperform the MC method. This was a single weight per participant approach unlike RDSAT's multiple weights per participant. Their approach made it possible to do regression analysis of RDS data. Calculation of variance analytically was made possible, but the problem of calculating CIs for a smaller group of respondents remained unresolved. Other approaches that have been proposed for analysis of RDS data are RDSMR estimate (for analyzing continuous variables controlling for differential recruitment), RDSSS estimates (for eliminating the condition of selection with replacement) and variance estimation ^{[6],[17],[18],[19]}. All are currently in various stages of development ^{[20]}. RDS analyst ( RDSA) is the currently available most advanced software, but the problem with small samples and calculation of CIs for the estimation of crossclassified data remain a problem ^{[21]} and it adopts bootstrap approach for the calculation of CIs. Thus, we need an approach or interface that allows use of RDS data in standard statistical applications and software.
The objective was to propose and validate a new twostep approach termed as RDSMOD for analyzing RDS data. We hypothesized that determining a single normalized weight for each participant (irrespective of number of variables) and transforming RDS data into clustered data without affecting the recruitment pattern (sequence and equilibrium) would enable calculating CIs of the estimates including regression coefficients analytically in case of RDS data.
Material & Methods   
The RDSMOD was applied to a real dataset from India as well as four datasets available in public domain to estimate various parameters and their CIs. In addition, three real datasets were obtained and estimates were presented. STATA SVY module was used for analysis. Taylor method of linearization was applied to calculate standard errors (SEs) of estimates ^{[22]}. The precision of the proposed estimates was assessed based on the length of the CIs.
Data sources
Indian dataset: We used the first round data of Integrated Behavioural and Biological Assessment (IBBA) conducted in Churachandpur District of Manipur State, India, during the first quarter of 2006 on 419 IDUs recruited using RDS ^{[23],[24]}. Three more datasets of the same survey were also obtained and HIV prevalence estimates were presented.
The datasets available in public domain were: (i) 13: Three simulated datasets of 'RDSA' module. The datasets were faux (RDS); fauxsycamore (RDS) and fauxmadrona (RDS)^{[21]}. (ii) 4: Jazz musicians dataset of RDSAT 7.1.46^{[25]}.
Statistical analysis: Bland–Altman method was used to compare the single weight generated by RDSMOD and the individualized weights generated by RDSAT for different variables of Churachandpur data ^{[26]}. The parameters and their 95 per cent CIs were estimated. The CIs were obtained as an output of SVY module of STATA using linearization and replication method for the calculation of SEs ^{[27]}. For multiple regression analysis, weighted generalized estimation equation (WGEE) was used.
For analysis, RDSAT 6.0.1(Cornell University Ithaca, NY), RDSA, STATA 10 (Stata Corp, Texas USA); and SAS (Enterprise edition 4.3, SAS Institute, Cary, North Carolina, USA) were used. Drawing of network was performed by NetDraw 2.090. (Borgatti S.P, NetDraw Software for Network Visualization, Lexington, NY).
Data analysis by new approach (RDSMOD)
Derivation of single weight and estimation method: Under the assumption in this chain referral sampling, the selection of a subject by a recruiter from his network is independent and is probability proportional to his degree (the number of men he knows and they know him) (d_{i})^{[8]}. A unique sampling weight Wi was derived for i^{th} participant. With these new weights and survey sampling module (SVY) of STATA, population parameters estimates were calculated ^{[27]}.
Formation of clusters
Formation of clusters in Churachandpur dataset (real dataset): The recruitment pattern is depicted as recruitment network diagram using 'NetDraw' in [Figure 1]A. RDS data were converted into clusters by discarding all the seeds from network chains [Figure 1]B. All participants of the branch in a network chain after discarding a seed were considered as members of that cluster. An assumption was made that the clusters thus formed were independent though some traits (characteristic affiliation) of respective seeds would prevail upon the members of the clusters thus formed. However, this correlation would minimize or vanish with expanding waves and widening of gap between recruiters and recruits, thus diluting the trait of the seed. Further, the recruits within the clusters would have intracluster correlations that need to be addressed in any type of analysis. Had the number of seeds been more and independent, all recruits under a seed might have to be considered as independent clusters. In addition, with more number of clusters, the estimates could be better.  Figure 1: (A) RDS Network recruitment diagram of injecting drug users (IDUs) in Churachandpur district of Manipur State of India, 2006. (B) Network recruitment diagram of clusters created from the IDUs in Churachandpur district of Manipur State of India, 2006. All red circles: HIV +ve; All green triangles: HIV –ve. C1 to C17 are clusters formed from networks. Seeds are in larger size. Arrowmarks represent direction of recruitment chain. (NetDraw 2.090 software, Data Source: IBBA Round 1).
Click here to view 
Formation of clusters in other example datasets: To perform RDSMOD on other datasets (datasets in public domain) all the recruits under a seed were considered as cluster. Hence, the clusters would be independent if the seeds were independent. Thus, if there were ten seeds, ten clusters would be formed. For example, in fauxsycamore dataset of RDSA, there were ten seeds. Therefore, ten clusters were formed for this analysis. For the dataset 'faux (RDS)', there was only one seed, and therefore, all the recruits were assumed to be from a single cluster. This was done to study the performance of RDSMOD in situ ations where all the recruits under only one seed constituted a cluster.
Formation of clusters in yet other datasets (real datasets, viz. Bishnupur, Phek and Wokha): In Bishnupur dataset all seeds (nine) were removed. In the process, one cluster with single respondent was not considered. Thus, there were ten exclusions (nine seeds and one firstwave respondent).
In Phek and Wokha datasets, all recruits under a seed were considered as a cluster (nine clusters each).
Data analysis by RDSAT: The analysis tool RDSAT (6.0.1) was set to use average network size by adjusted mean values method. The number of resamplings to determine bootstrap 95 per cent CI was set to 2500. The enhanced smoothing algorithm type was employed. Homophily (Hx) for each variable was obtained to understand the magnitude of the characteristic affiliation of recruiters with their recruits. The number of waves required for attaining equilibrium was also estimated by choosing a convergence radius of 0.02. We assumed a median base population as 10,000 and set 500 as bootstrap replication to analyze additional three real datasets (viz. Bishnupur, Phekh and Wokha).
Comparison of weights by RDSMOD and the individualized weights of RDSAT for different variables under analysis: Single weight per study participant derived for RDSMOD and the individualized weights generated by RDSAT for different variables were compared by Bland–Altman method ^{[26]}. The difference in weights by RDSAT and the calculated weight for RDSMOD were plotted against the mean of the weights by these two approaches for a variable [Figure 2]. If the points on the Bland–Altman plot were uniformly scattered between the limits of agreement, it would suggest good agreement between the two weights by two different approaches. This analysis was performed to compare the individualized weights generated for each of variables by RDSAT and the single weight for each individual by RDSMOD.  Figure 2: Agreement between the weights generated by RDSAT and RDSMOD by Bland  Altman method. Source: Paul Seed, 2014. “BAPLOT: Stata module to produce BlandAltman plots,” Statistical Software Components S457853, Boston College Department of Economics.
Click here to view 
Comparison of estimates of RDSMOD and RDSA on example datasets: As our weights were similar to RDS II of Volz and Heckathorn ^{[6]}, RDSMOD comparison of parameters by RDSA (RDS II) would yield similar estimates but not the CIs. The comparison of estimates by different approaches using Churachandpur data is presented in [Table 1] and [Table 2]. The results of comparison using the datasets, viz. faux, fauxmadrona, fauxsycamore of RDSA and Jazz musicians' dataset of RDSAT 7.1.46 are presented in [Table 3] and [Table 4]. In addition, HIV prevalence estimates were calculated both by RDSA and RDSMOD of yet another three real datasets (data not shown).  Table 1: Estimates of proportions of population parameters by respondentdriven sampling (RDS)MOD (modified), RDSA (analyst) and respondentdriven sampling analysis tool (RDSAT) Churachandpur data
Click here to view 
 Table 2: Estimates of proportions of HIV status crossclassified by factors using respondentdriven sampling. (RDS)MOD. (modified), RDSA. (analyst) and respondentdriven sampling analysis tool (RDSAT) Churachandpur data
Click here to view 
 Table 3: Estimates of proportions [given as (estimates and 95% CI)] of various subgroup of variables using respondentdriven sampling (RDS)MOD (modified), RDSA (analyst) and respondentdriven sampling analysis tool (RDSAT) on different datasets, viz faux, fauxsycamore, fauxmadrona
Click here to view 
 Table 4: Estimates of proportions [estimates (95% CI)] using respondentdriven sampling (RDS)MOD (modified), RDSA (analyst) and respondentdriven sampling analysis tool (RDSAT) on Jazz musician datasets
Click here to view 
Regression analyses: Similarity induced by clusters would violate the standard assumption of independent observations from each individual. In the regression setup, generalized estimating equation approach accounts for intracluster correlation ^{[28]}. This approach was used to study the affiliation factors for HIV positivity (factors associated with HIV) among the IDUs recruited in Churachandpur, India. Regression analysis was performed using WGEE approach with autoregressive1 (AR1) correlation structure as logical ordering of recruitment was inevitably present in the RDS selection process. Furthermore, the AR1 structure is appropriate when the correlation between various sample units is expected to decrease with the increasing distance within the recruitment chain. The CIs were obtained for the parameter estimates of regression equation by SAS software ^{[29]}. The results are presented in [Table 5]. To understand the nature of linkages in the recruitments among IDUs with HIV status, WGEE with exchangeable correlation structure was also performed. If Hx is high for a variable, the parameter estimates with these two assumed correlation (AR1 and exchangeable) structures would vary in the regression.  Table 5: Factors associated with HIV for different correlation structure of weighted generalized estimation equation
Click here to view 
Results   
The proposed weights were similar to RDS II of Volz and Heckathorn ^{[13]} but for a constant term (harmonic mean of the network size) in the numerator as normalizing factor.
Comparison of estimates of parameters of different datasets by RDSMOD with RDSAT and RDSA
Churachandpur, India RDS data: All the seeds grew and reached up to seven waves. The recruitment per se ed ranged from 50 to 111. A random mixing pattern of recruitment was observed among HIV positives and negatives in the network recruitment diagram [Figure 1]A and also in the demographically adjusted recruitment matrices of RDSAT (symmetry, not shown). Estimated waves required to reach equilibrium was 23 for all variables, except for only one subset variable. All the variables considered for this analysis attained equilibrium with more than six waves. RDSAT and RDSA used all 419 recruits for analysis.
For our new approach (RDSMOD), 17 clusters were formed after discarding six seeds [Figure 1]B. For example, when seed 2 was removed, three clusters were formed with assigned cluster numbers C4, C5 and C6 of size 41, 28 and 41, respectively. Only one cluster (C2) of size one from seed 1 was not considered for cluster analysis. Thus, 16 clusters of 412 recruits were available for analysis by our new approach with a loss of seven participants including, six seeds and one firstwave recruit.
Bland–Altman plots indicated that single weights of RDSMOD and weights of RDSAT for different variables were within the acceptance limits (i.e. within the mean of differences of weights between these two methods ±1.96 of standard deviation of these differences) for all the four variables considered [Figure 2]. Further, the mean of the differences was nearly zero signalling that the single weight per participant was similar to multiple weights of RDSAT. However, the trends within them indicated that the differences varied with magnitude. Thus, the calculated weights by both methods depended strongly on each other. This indicated the reasonability between the weights calculated by our approach for an individual and several of RDSAT's weights for different variables of the same individual for this dataset.
The estimates of proportions by RDSMOD across variables were similar to RDSA and RDSAT [Table 1]. However, the CIs by RDSMOD were wider compared to other two methods. The Hx was highest for the character 'sharing of injection needle' (Hx = 0.381). The other affiliation characters for HIV were 'daily injection of drugs' (Hx = 0.156) and 'duration of injecting drugs more than five years' (Hx = 0.143). Negative affiliation in the recruitment was noticed among those with duration of injecting drugs less than two years (Hx = −0.149). For a subsample, RDSAT produced estimates not in tune with the observed frequencies and RDSAT could not produce CI. For example, a random sample of 37 specimens was tested for herpes simplex virus type 2 (HSV 2). Among them eight were 'positive', two were 'inconclusive' and the remaining 'negative'. RDSAT estimated the proportions of positives (8 out of 37) as 0 per cent and inconclusive (2 out of 37) as 37.1 per cent. RDSAT showed that the estimated mean number of waves to attain equilibrium for this variable was as high as 1960 [Table 1].
RDSMOD yielded similar and comparable parameter estimates of crossclassified variables as well [Table 2]. However, slightly wider CI was noticed in many of the estimates by RDSMOD both in [Table 1] and [Table 2]. RDSA did not produce CI for crossclassified data [Table 2].
By RDSMOD method, all parameters were reestimated using individualized weights per variable generated by RDSAT (data not shown). The estimates were almost similar to that of single weighting procedure of RDSMOD implying that single weight per individual was sufficient rather than multiple weights per individual. Similar exercise was performed on crossclassified data of parameters. The estimates by single weight approach and those with individualized weights of RDSAT were identical (data not shown). This also indicated that single weighting approach worked well for crossclassified estimations.
Comparison of results of RDSMOD, RDSAT and RDSA with other example datasets: The estimates of parameters and their CIs of the other example RDS datasets are presented in [Table 3] and [Table 4]. This exercise was done for not drawing any inference but to compare the estimates.
Dataset 1 – faux: From the faux dataset, the variable 'A', the estimates of population parameters by RDSA (RDS II), RDSMOD and RDSAT (RDS I) were almost similar including their CIs. However, for variable 'B' of the same datasets, population estimates by RDSA and RDSMOD were identical and RDSAT (RDS I) yielded varied results for all the three parameters.
Dataset 2 – fauxsycamore: For all the three variables (C, D, E), the estimates by RDSMOD and RDSA were comparable. The CIs produced by RDSMOD were almost equal or narrower than RDSA. RDSAT (RDS I) produced results differently for all these variables.
Dataset 3 – fauxmadrona: This dataset contains three variables (F, G, H). The variable 'G' has 14 groups and 'H' has 13 groups. The estimates by RDSMOD and RDSA were identical and the estimates by all the three methods were comparable. For 'G', the CIs were wider for RDSA. RDSMOD gave narrow CIs. RDSMOD and RDSA produced negative limits for some CIs. However, RDSAT did not produce any of that type.
Dataset 4 – Jazz musician: The estimates of parameters by RDSMOD and RDSA were identical. RDSMOD yielded narrow CI. RDSAT also produced comparable results for all the four variables.
Dataset 5 – Bishnupur, Phek and Wokha: HIV prevalence estimate and CI was similar to RDSA in Wokha and CI was slightly wider in Bishnupur data. However, CI of Phek data by RDSMOD was wider (data not shown).
Results of regression analysis: WGEE both with AR1 and exchangeable correlation structures showed that older age groups (≥25 yr) and longer period of injecting drug use ( ≥6 yr) were associated with HIV positivity [Table 5]. Sharing of needles daily was not associated with HIV positivity. The similarity of regression coefficients both by AR1 and exchangeable correlation structure indicated the random mixing nature of HIV status of recruiter and recruits in Churachandpur data [Table 5].
Discussion   
As there is no method available to make the true estimate of the parameter of RDS data, all our comparisons were mainly with RDS II (this method calculates SEs of estimates analytically) estimates, and hence, the comparison of the RDSMOD estimates with other methods such as successive sampling method was not possible. The proposed new approach for analysis of RDS data was simple and less timeconsuming. Additionally, this approach was able to generate population estimates comparable to those derived by RDSA (RDS II), the currently available most advanced level software to analyze RDS data. Precision of estimates by our approach appeared to be superior to RDSA in the example datasets [viz. faux (RDS); fauxsycamore (RDS); fauxmadrona (RDS) and Jazz musicians dataset].
In the new approach, clusters were formed without affecting sequential and natural ordering of selection. In this process, though information on all seeds (nonrandom) was lost, RDS data were robust and not likely to be affected by the inclusion or exclusion of out of equilibrium data, i.e. data collected before reaching equilibrium ^{[30]}. Discarding of earlier waves has also been recommended in previous reports ^{[10],[15]}. Thus, our result might not be affected by discarding the six seeds from the analysis. Formation of clusters paved a way to account for the related characters in the recruitment process and provided ways to other statistical methods and analysis by routine statistical software.
It has been suggested that the tendency towards Hx varies among groups ^{[31]}. Hence, it is important to measure the tendency towards Hx with respect to different respondent's characteristics and to use this information to weight the sample to compensate for any biases ^{[31]}. It appears that in general, the network's composition with respect to personal attributes may exhibit Hx with respect to only a particular trait or with respect to a few characters. In our sample, 'sharing of needle' was the most prominent trait (knowing each other) irrespective of the HIV status. Therefore, technically, a single unique weight would be sufficient to compensate the Hx of different respondent characteristics. In addition, as sampling weights were used only to compensate the unequal probability of inclusion into the sample, a common weight for each individual was sufficient rather than individualized weights for each variable and the results were comparable to those of RDSAT. The similarity of the parameter estimates by our approach using RDSAT weights in oneway and twoway tables indicated that a single weight per individual was sufficient.
Our RDSMOD approach was similar to RDS II of Volz and Heckathorn ^{[6]}. The only difference was that the weights by RDSMOD had an additional constant multiplier compared to Volz and Heckathorn. Although this constant does not affect the estimates, it is needed for the physical comparison of weights with that of RDSAT. The very basic assumption made was that a recruiter recruited the subjects independently with probability proportional to network size of the recruiter. This assumption was based on the work of Salganik and Heckthorn ^{[8]}, who showed that a random walk on network was a Markov Process, in which equilibrium occupied a node with probability proportional to degree. The applicability of HansenHurwitz estimator with these assumptions provides theoretical and conceptual foundation to our approach of deriving unique weights ^{[32]}.
The proposed approach necessitated the need for incorporating clustering effect in the regression model as clustering results in lack of independence among the errors in regression. Generalized estimating equation (GEE) approach resolves this problem by appropriately accounting the correlation structure of a variable of interest between recruits and recruiter ^{[28]}. The process of fitting a model should incorporate sample weights as well as information about correlation between sample units. Weighted estimating equations are the most popular methods for obtaining consistent estimates of regression coefficient with sample survey data ^{[33],[34]}. Therefore, we used WGEE approach to study the affiliation factors for the HIV positivity. AR1 correlation structure accounted for logical ordering of recruitment and the exchangeable correlation assumed equality of correlation between any two recruited individuals within a cluster. The similarity of results due to different correlation structures (AR1 and exchangeable) suggested that HIV status was not an indicative factor for recruitment preferences in recruiting HIVpositive/negative IDUs in Churachandpur dataset. Slight variations found in model coefficients by these two procedures (AR1 and exchangeable) could be due to possible omission of a variable from the model that had a strong interaction with the independent variables and was highly correlated with the weights ^{[35]}.
The advantage of our approach is that it allows estimations using standard software such as STATA or any other software that accommodates survey sampling method. Also, the problem of subgroup analysis of RDS data could be overcome.
It was assumed that the clusters formed were independent after discarding a seed although some traits of that seed might prevail upon clusters of that seed. However, this limitation can be overcome by selecting more seeds at the stage of data collection (preferably independent) and considering each seed with its recruits as a separate cluster as has been done with other example datasets. The new approach resulted in slightly wider CIs in Churachandpur data compared to RDSA and RDSAT. The possible reasons could be that RDSMOD employed the analytical method to calculate these. It could also happen if the intracluster correlation was high. Volz and Heckathorn ^{[6]} have reported that wider CI is expected when the variances are calculated analytically. Empirical tests of RDS have indicated that the analytical method overestimates the CIs ^{[30]}. As we have also accounted for the intracluster correlation, still wider CIs are expected ^{[36]}. Clustering and weighting normally result in decreased precision ^{[37]}. Volz and Heckathorn ^{[6]} suggested that the loss of precision might happen due to the single weights used for all variables as was done in the present study. In contrast, RDSMOD yielded estimates with the same or higher precision in other simulated example datasets, viz., faux (RDS); fauxsycamore (RDS) and fauxmadrona (RDS) and Jazz musicians dataset used for comparisons. This indicated the possibility that intracluster correlation in these example datasets (simulated data) was not high. As the inferences based on RDS data require many strong assumptions, Gile et al^{[38]} have suggested some diagnostic tools to empower researchers to understand their RDS data better and encourage future statistical research on RDS sampling and inference.
Our study had certain limitations. The weight calculation was solely dependent on the network size (degree reported by the respondent). Thus, inaccuracies in the selfreported degree might have introduced biases in the estimates. HanzenHurwitz method is applicable only when the sample elements are selected independently with replacement and that may not be true in the real sense of RDS. We assumed that the clusters formed were independent though some traits of corresponding seed would have prevailed. A few of the lower limits of our CIs in an example dataset for proportions were negative.
In conclusion, the proposed alternative approach of using single weight and converting RDS data into clusters before analysis can be recommended as it generates analytical CIs and allows for estimates for smaller groups as well. RDS data can thus be analyzed faster using commonly used statistical software that also permits wider range of statistical analysis including analysis of continuous variables.
Conflicts of Interest: None.
References   
1.  Goodman L. Snowball sampling. Ann Math Stat 1961; 32 : 14870. 
2.  Deaux E, Callaghan JW. Key informant versus selfreport estimates of health behavior. Eval Rev 1985; 9 : 3658. 
3.  Watters JK, Biernacki P. Targeted sampling: options for the study of hidden populations. Soc Probl 1989; 36 : 41630. 
4.  Heckathorn DD. Respondentdriven sampling: a new approach to the study of hidden populations. Soc Probl 1997; 44 : 17499. 
5.  Singh P, Pandey A, Aggarwal A. Housetohouse survey vs. snowball technique for capturing maternal deaths in India: a search for a costeffective method. Indian J Med Res 2007; 125 : 5506. 
6.  Volz E, Heckathorn DD. Probability based estimation theory for respondentdriven sampling. J Off Stat 2008; 24 : 7997. 
7.  Fararo TJ, Skvoretz J. Biased networks and social structure theorems. Soc Networks 1984; 6 : 22358. 
8.  Salganik MJ, Heckathorn DD. Sampling and estimation in hidden populations using respondentdriven sampling. Sociol Methodol 2004; 34 : 193239. 
9.  Gile KJ, Handcock MS. Respondentdriven sampling: an assessment of current methodology. Sociol Methodol 2010; 40 : 285327. 
10.  Goel S, Salganik MJ. Respondentdriven sampling as Markov chain Monte Carlo. Stat Med 2009; 28 : 220229. 
11.  Griffiths P, Gossop M, Powis B, Strang J. Reaching hidden populations of drug users by privileged access interviewers: methodological and practical issues. Addiction 1993; 88 : 161726. 
12.  Johnston LG, Malekinejad M, Kendall C, Iuppa IM, Rutherford GW. Implementation challenges to using respondentdriven sampling methodology for HIV biological and behavioral surveillance: field experiences in international settings. AIDS Behav 2008; 12 : S13141. 
13.  Volz E, Wejnert C, Degani I, Heckathorn DD. Respondentdriven sampling analysis. RDSAT 6.0.1 Ithaca, NY: Cornell University; 2007. 
14.  Salganik MJ. Variance estimation, design effects, and sample size calculations for respondentdriven sampling. J Urban Health 2006; 83 : i98112. 
15.  Burt RD, Thiede H. Evaluating consistency in repeat surveys of injection drug users recruited by respondentdriven sampling in the Seattle area: results from the NHBSIDU1 and NHBSIDU2 surveys. Ann Epidemiol 2012; 22 : 35463. 
16.  Taran YS, Johnston LG, Pohorila NB, Saliuk TO. Correlates of HIV risk among injecting drug users in sixteen Ukrainian cities. AIDS Behav 2011; 15 : 6574. 
17.  Gile KJ. Improved inference for respondentdriven sampling data with application to HIV prevalence estimation. J Am Stat Assoc 2011; 106 : 13546. 
18.  Heckathorn DD. Extensions of respondentdriven sampling: analyzing continuous variables and controlling for differential recruitment. Sociol Methodol 2007; 37 : 151207. 
19.  Szwarcwald CL, de Souza Júnior PRB, Damacena GN, Junior AB, Kendall C. Analysis of data collected by RDS among sex workers in 10 Brazilian cities, 2009: estimation of the prevalence of HIV, variance, and design effect. J Acquir Immune Defic Syndr 2011; 57 : S12935. 
20.  Salganik MJ. Respondentdriven sampling in the real world. Epidemiology 2012; 23 : 14850. 
21.  Handcock MS, Fellows IE, Gile KJ. Software for the analysis of respondentdriven sampling data. Version 0.42. Los Angeles, CA: Hard to Reach Population Methods Research Group. 2014. 
22.  Shah VB. Linearization methods of variance estimation. In: Armitage P, Colton T, editors. Encyclopedia of Biostatistics. New York: John Wiley & Sons, Inc.; 1998. p. 2276  9. 
23.  Mahanta J, Medhi GK, Paranjape RS, Roy N, Kohli A, Akoijam BS, et al. Injecting and sexual risk behaviours, sexually transmitted infections and HIV prevalence in injecting drug users in three states in India. AIDS 2008; 22 (Suppl 5) : S5968. 
24.  Chandrasekaran P, Dallabetta G, Loo V, Mills S, Saidel T, Adhikary R, et al. Evaluation design for largescale HIV prevention programmes: the case of Avahan, the India AIDS initiative. AIDS 2008; 22 (Suppl 5) : S115. 
25.  Volz E, Wejnert C, Cameron C, Spiller M, Barash V, Degani I, et al. Respondentdriven sampling analysis tool ( RDSAT). Version 7.1. Ithaca, NY: Cornell University; 2012. 
26.  Altman DG, Bland JM. Measurement in medicine: the analysis of method comparison studies. Statistician 1983; 32 : 30717. 
27.  StataCorp. Stata statistical software. Release 10. 10 ^{th} ed. TX: StataCorp LP; 2007. 
28.  Liang KY, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73 : 1322. 
29.  SAS ^{®}. SASadministering SAS^{®} enterprise guide^{®}. 4.3. Cary, NC: SAS Institute Inc.; 2010. 
30.  Wejnert C. An empirical test of respondentdriven sampling: point estimates, variance, degree measures, and outofequilibrium data. Sociol Methodol 2009; 39 : 73116. 
31.  Heckathorn DD. Respondentdriven sampling II: deriving valid population estimates from chainreferral samples of hidden populations. Soc Probl 2002; 49 : 1134. 
32.  Hansen M, Hurwitz W. On the theory of sampling from finite populations. Ann Math Stat 1943; 14: 33362. 
33.  Binder DA. On the variances of asymptotically Normal Estimators from Complex Surveys. Int Stat Rev 1983; 51 : 27992. 
34.  Pfeffermann D. The role of sampling weights when modeling survey data. Int Stat Rev 1993; 61 : 31737. 
35.  Korn EL, Graubard BI. Examples of differing weighted and unweighted estimates from a sample survey. Am Stat 1995; 49 : 2915. 
36.  Pfeffermann D. The use of sampling weights for survey data analysis. Stat Methods Med Res 1996; 5 : 23961. 
37.  Dowd AC, Duggan MB. Computing variances from data with complex sampling designs. Comparison of STATA and SPSS. Boston: North American Stata Users Group. 2001. 
38.  Gile KJ, Johnston LG, Salganik MJ. Diagnostics for respondentdriven sampling. J R Stat Soc Ser A Stat Soc 2015; 178 : 24169. 
[Figure 1], [Figure 2]
[Table 1], [Table 2], [Table 3], [Table 4], [Table 5]
This article has been cited by  1 
Populationbased assessment of health, healthcare utilisation, and specific needs of Syrian migrants in Germany: what is the best sampling method? 

 Tobias Weinmann,Amal AlZahmi,Andreas Schneck,Julian Felipe Mancera Charry,Günter Fröschl,Katja Radon   BMC Medical Research Methodology. 2019; 19(1)   [Pubmed]  [DOI]   2 
Intervention Reach and Sexual Risk Reduction of a Multilevel, CommunityBased HIV Prevention Intervention for Crack Users in San Salvador, El Salvador 

 Julia DicksonGomez,Sergey Tarima,Laura R. Glasman,Julia Lechuga,Gloria Bodnar,Lorena Rivas de Mendoza   AIDS and Behavior. 2018;   [Pubmed]  [DOI]  



