Methods

 

 

 

 

Bipolar Disorder


 

Genomewide Linkage Studies

Samples

 

The family samples were all of European American ancestry and came from 8 groups:

 

1) The NIMH Genetics Initiative of Bipolar Disorder (NIMH) (n=581)
2) Johns Hopkins University (JHU) (n=114)
3) Cardiff (n=92)
4) Trinity/Ireland (n=73)
5) University of California, San Diego (UCSD) (n=50)
6) Edinburgh (n=27)
7) NIMH/University of Chicago (n=22)
8) University of California, San Francisco/Utah (n=13)

 

There was a total of 972 families after quality control.  All subjects underwent semi-structured interviews and were interviewed by at least two clinicians who assigned a consensus diagnosis.  Three different affection status models were considered:

1) ASMI: bipolar I disorder and schizoaffective disorder, bipolar sub-type
2) ASMII: ASMI and bipolar II disorder
3) ASMIII: ASMII and recurrent major depression

 

Genotyping

 

Genotyping was done on all samples by Center for Inherited Disease Research using the Illumina Linkage Panel 12 set of 6090 SNPs.

 

Analysis

 

Quality control included the following:

 

1) SNPs with >3% missing data or subjects with >2% missing data were removed
2) SNPs out of Hardy-Weinberg equilibrium at p<0.001 were removed
3) PedCheck was used to check for Mendelian errors, and if errors were found the SNP was set to missing for all members in the pedigree
4) PREST was used to analyze a subset of 4969 SNPs in low linkage disequilibrium (pairwise r2<0.5) for identity by descent between subjects and remove any whose reported relationships were not consistent with their reported relationships
5) Merlin was used to assess for unlikely recombinants and problematic genotypes were set to missing
6) Eigensoft was used to carry out principal component analysis of sample with HapMap samples to identify and remove subjects/families of non-European-American ancestry

 

Parameteric and non-parametric linkage analyses were carried out with Merlin version 1.1.2 using the correction for linkage disequilibrium by identifying clusters of SNPs with pairwise linkage disequilibrium of at least 0.1.  Large pedigrees were trimmed as necessary using PedShrink.

 

Parametric linkage analyses were carried out with two different models:

 

1) A heterogeneity dominant model with risk allele frequency of 0.01 and penetrances of 0.01, 0.70, and 0.70
2) A heterogeneity recessive model with risk allele frequency of 0.1, and penetrances of 0.01, 0.01, and 0.70

 

Non-parametric linkage analyses were carried out under two different models:

 

1) Using SAll under the exponential model
2) Using SPairs under the linear model

 

Genetic maps were obtained from Rutgers Map Interpolator software that estimates genetic distances from the physical locations of the SNPs using the Rutgers combined linkage physical map of the human genome.

 

 

Genomewide Association Studies

Samples

 

A total of 10 case-control samples of European American ancestery were included:

 

1) The BOMA-Bipolar Study (n = 675 cases, 1297 controls)
2) The Genetic Association Information Network (GAIN) / The Bipolar Genome Study (BiGS) (n = 1001 cases, 1032 controls)
3) GlaxoSmithKline (GSK) (n = 892 cases, 902 controls)
4) Pritzker Neuropsychiatric Disorders Research Consortium (NIMH/Pritzker) (n = 1130 cases, 772 controls)
5) Systematic Treatment Enhancement Program for Bipolar Disorder (STEP1) (n = 927 cases, 1468 controls)
6) Thematically Organized Psychoses (TOP) Study (n = 205 cases, 367 controls)
7) Trinity College Dublin (n = 150 cases, 797 controls)
8) University College London (UCL) (n = 490 cases, 495 controls)
9) University of Edinburgh (n = 282 cases, 275 controls)
10) Wellcome Trust Case Control Consortium (n = 1864 cases, 2935 controls)

 

Genotyping

 

Data across the 10 studies were generated with 4 different platforms: Affymetrix 500k, Affymetrix 5.0, Affymetrix 6.0, and Illumina HapMap 500.

 

Analysis

 

SNP names, positions and strand were harmonized. The following quality control filters were then applied:

 

1) Individuals were removed if the missing genotype rate > 0.02 (considering only SNPs with <5% missing data)
2) SNPs were removed if the missing genotype rate >0.02, if the missing genotype rate between cases-controls was different by >0.02, Hardy-Weinberg equilibrium in controls p<1x10-6, or the frequency difference to the HapMap reference was >0.15

 

Duplicate within and across samples were identified by PLINK if estimated probability of genome-wide identity-by-descent of sharing two chromosomes was above 90%.  The duplicates were then removed based on the goal of preserving the case:control ratio in each study as close to 50:50 as possible and favoring data generated using more recent platforms. 

Data were imputed using BEAGLE 3.0 with phased HapMap phase 2 data as reference.  Each dataset was imputed separately, with each randomly split into imputation batches of 300 individuals, keeping case-control ratio balanced

Multi-dimensional scaling with SNPs genotyped in common across all studies and in relative linkage equilibrium was carried out to assess for population stratification and to derive quantitative indices of ancestry to control for in downstream analyses.

After all QC, there were 16,731 individuals (7481 cases and 9250 controls) and 2,541,952 SNPs.  Analyses were done with 2,415,422 SNPs with MAF>0.01 and imputation r2>0.3.  The primary analysis was with logistic regression on single SNP allelic dosages including dummy-coded covariates for each of the 10 sites and the top 5 quantitative indices of ancestry from the MDS analysis.  To adjust for the relatedness between the siblings, a robust Huber-White sandwich variance estimator for cluster-correlated observations was used. The genomic inflation factor (λ) was calculated as the ratio between the observed and expected median χ2 statistics, was 1.148 and was used to correct for the degree of inflation.

 

 

Genomewide Expression Studies

Literature search and data collection

We exhaustively searched the PubMed database for genome-wide gene expression array studies in BP. We also searched the references of identified studies for thoroughness. In addition to published studies we also queried public microarray repositories including Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) and Array Express (http://www.ebi.ac.uk/arrayexpress/). Studies were included if they reported on a case-control genome-wide gene expression array study in BP in humans. Raw gene expression array data was obtained by scanning the literature to identify GEO accession identifiers (ID) or links for downloadable feature-level extraction output (FLEO) files such as CEL files. If the main text did not contain an accession ID or a link to any FLEO files, we searched existing repositories and the research group’s laboratory web pages. If unsuccessful, we wrote to the authors. If multiple publications used overlapping data, we identified the most comprehensive dataset available.

 

Data processing

In order to consistently handle all datasets and eliminate bias introduced by relying on different algorithms used in the original studies, we obtained the raw data and converted these into analysis-ready gene expression data matrices (GEDM) by processing each study individually using a single analysis pipeline outlined in five steps below:

Step 1: Normalization and background correction using Frozen Robust Multi-array Analysis (fRMA).
Step 2: Removing outliers using Principal Component Analysis (PCA).
Step 3: Filtering probesets using Presence Absence Calls with Negative Probesets (PANP).
Step 4: Removing batch effects using Surrogate Variable Analysis (SVA).
Step 5: Mapping probesets to genes using JetSet.

 

Data analysis

We combined the data for each study into a single large matrix and conducted a mega-analysis. We refer to this as a mega-analysis because the individual level data from each study were analyzed together instead of having been done separately by study and then having been summarized across studies as in a meta-analysis. We conducted three separate mega-analyses on samples from 1) any region of the brain (9 studies); 2) the prefrontal cortex (PFC) (6 studies); and 3) the
hippocampus (2 studies). To minimize heterogeneity across studies, we focused primarily on the most numerous, recent and comprehensive studies. The mega-analysis approach allowed us to more efficiently address the overlap in samples by using mixed effects linear regression with crossed random effects for study and subject to account for both within study and within subject correlations. We used the lmer function from the lme4 package in R with default parameters to fit the mixed-effects models. The primary fixed effect of interested was a dichotomous variable for case-control status. Since the SVA was carried out to address measured and unmeasured confounding, we did not include other fixed effects covariates in the models. We fit separate models for each gene. Summary fold changes (FC) by case-control status, standard errors, 95% confidence intervals, p-values and false discovery rate qvalues for each gene were stored as output from each of the models. We used the program DAVID to determine if there was an enrichment of common pathway annotations among the significant differentially expressed genes in the three mega-analyses.

 

Results

A total of 30 genome-wide gene expression studies of BP done with blood or brain tissue were identified. We included 10 studies with data on 211 microarrays on 57 unique BP cases and 229 microarrays on 60 unique controls in the quantitative mega-analysis. A total of 382 genes were identified as significantly differentially expressed by the three analyses. Eleven genes survived correction for multiple testing with a q-value < 0.05 in the PFC. Among these were FKBP5 and WFS1, which have been previously implicated in mood disorders. Pathway analyses suggested a role for metallothionein proteins, MAP Kinase phosphotases, and neuropeptides.

 

 

 

Candidate Gene Association Studies

Literature Review

All published candidate gene association studies of the selected gene were identified by an exhaustive search of the PubMed database plus a snowball search of references of identified studies. 

 

Studies were included if they:
1) Reported on a candidate gene association study (case-control or family-based) in adult humans,
2) Examined individual level genotype data with an appropriate set of controls,
3) Tested for association with a dichotomous phenotype of bipolar disorder that included bipolar I disorder (BPI), bipolar II disorder (BPII), and/or schizoaffective disorder bipolar sub-type (SABP),
4) Published in a peer-review journal in English.

 

Studies were excluded if they:
1) Did not distinguish between types of mood disorders,
2) Examined drug response or other quantitative traits,
3) Examined specific clinical features and/or co-morbidities within mood disorders,
4) Did not report on primary data,
5) Did not provide any characterization of the study sample

 

Meta-analysis

A quantitative meta-analysis was carried out for each polymorphism that was examined in three or more case-control studies.  Key data from each of these studies were extracted into evidence tables, including:

 

1) PubMed ID, author, journal and year of publication;
2) study design;
3) sample information including country of origin, ethnicity, source of ascertainment and subject counts;
4) diagnoses, diagnostic instrument and diagnostic criteria used;
5) polymorphism information including dbSNP reference SNP identification number (rsID), base pair (BP) location, major/minor allele, and genotype/allele counts for cases and controls or transmitted/untransmitted alleles for families;
6) sample meta-information including percentage of males for cases and controls, age at interview and age at onset.

 

To assure the data for each polymorphism was recorded with a common orientation across all studies, we used the HapMap(International HapMap Consortium, 2003) - release 24, Phase I and II CEPH (Centre d'Etude du Polymorphisme Humain, Utah residents with ancestry from northern and western Europe) to designate the strand orientation and major/minor alleles.  If  a polymorphism was not genotyped in the HapMap CEPH samples, we used the SNP database of the National Center of Biotechnology Information (NCBI) dbSNP version 130 with human genome assembly build 36 and the University of California Santa Cruz (UCSC) Genome Browser Database to determine a common orientation. Many studies did not provide rsIDs, but instead reported codes based on nucleotide or amino acid position.  To retrieve rsIDs for such polymorphisms we again used the dbSNP database.  If a polymorphism still could not be mapped to an rsID we used the UCSC’s genome browser In-Silico PCR primer-BLAT to map the polymorphisms using the provided primer sequences.  When only allele counts were provided, we calculated the corresponding genotype counts assuming Hardy-Weinberg Equilibrium.  If the polymorphism was not in Hardy-Weinberg equilibrium in the controls, we kept the genotype counts as missing.

 

Allelic odds ratios (OR), standard errors (SE), 95% confidence intervals (CI), p-values (P) and HWE for each polymorphism were calculated individually for each study.  The major allele (as designated by the HapMap - release 24, Phase I and II CEPH) for the polymorphisms was used as the reference to calculate the ORs.  For polymorphisms with multiple alleles we used the most common allele as the reference allele and calculated separate OR’s for all other alleles relative to this.  We excluded from the meta-analysis studies with genotype data that was not in HWE in controls. For each polymorphism, we calculated Woolf's chi-squared to assess for between-study heterogeneity. Allelic summary OR and 95% CI were estimated under the DerSimonian-Laird random-effects model using inverse-variance weights in R statistical programming.  The significance of the summary OR was determined using an asymptotic Z-test. 

 

 

Major Depresion


Genomewide Linkage Studies

Samples

 

Families were recruited at 6 sites by opportunistic means:*
1) Columbia University (n=133)
2) University of Pittsburgh (n=119)
3) University of Pennsylvania (n=115)
4) Rush University Hospital (n=113)
5) Johns Hopkins University (n=98)
6) University of Iowa (n=98)

A total of 676 families were ascertained, of which 656 were included in the analysis. All subjects were assessed by telephone or in person using the Diagnostic Interview for Genetic Studies (DIGS 3.0) [PMID: 7944874] and the Family Interview for Genetic Studies (FIGS). Two research clinicians reviewed the clinical data and assigned independent and consensus diagnoses. Probands had to have at least two lifetime major depressive episodes lasting past age 18 or one episode lasting 3 or more years and with an age at onset before 31. All family members who met the same criteria, except with age of onset before 41, were also recruited.

 

Genotyping

Genotyping was carried out by the Center for Inherited Disease. A total of 418 microsatellite markers were genotyped at mean 9 cM spacing and mean marker heterozygosity of 0.77.

 

Analysis

Quality control included the following:

1) Pedigree structure and sample swap errors were detected by PEDCHECK [PMID: 9634505] using 15 markers and with RELCHECK [PMID: 9311748] using all makers.
2) Genotypes were excluded if they showed Mendelian inconsistencies or if SIBMED [PMID: 10739757] estimated 70% probability of error
* There were 3.29% missing genotypes, with 0.04% genotypewise errors based on duplicated genotypes and 0.32% detected Mendelian inconsistencies.

 

Linkage analysis:

Multipoint allele-sharing analysis was carried using ALLEGRO [PMID: 10802644]. The genetic map was based on the deCODE map [PMID: 12053178] or derived by interpolation from Marshfield locations. Linkage was calculated using both the linear and exponential models with the Spairs scoring function and equal weighting for all families (note that this weighting scheme is different than what was used in the manuscript and may account for the slightly different results). We report the LOD, NPL and Zlr scores.

 

 

Genomewide Association Studies

Samples

 

A total of 9 case-control samples of European American ancestery were included:

 

1) The Genetic Association Information Network (GAIN) (n = 1696 cases, 1765 controls)
2) The Genetics of Re-current Early Onset Depression (GenRED) (n = 1030 cases, 1253 controls)
3) GlaxoSmithKline (GSK) (n = 887 cases, 864 controls)
4) MDD2000-QIMR_610 (n = 433 cases, 751 controls)
5) MDD2000-QIMR_317 (n = 1017 cases, 960 controls)
6) MPIP (n = 376 cases, 537 controls)
7) RADIANT+Bonn/Mannheim (n = 935 cases, 1290 controls)
8) RADIANT (n = 1625 cases, 1588 controls)
9) STAR*D (n = 1241 cases, 511 controls)

 

Cases were required to have a DSM-IV diagnosis of lifetime Major Depression established using structured diagnostic instruments from direct interview by trained interviewers or clinician administered DSM-IV checklists.  Two studies required recurrent Major Depression and one, re-current early-onset Major Depression.

 

Genotyping

Data across the 10 studies were generated with 4 different platforms: Affymetrix 500k, Affymetrix 5.0, Affymetrix 6.0, and Illumina HapMap 500.

 

Analysis

SNP names, positions and strand were harmonized. The following quality control filters were then applied:

 

1) Individuals were removed if the missing genotype rate > 0.02 (considering only SNPs with <5% missing data)
2) SNPs were removed if the missing genotype rate >0.02, if the missing genotype rate between cases-controls was different by >0.02, Hardy-Weinberg equilibrium in controls p<1x10-6, or the frequency difference to the HapMap reference was >0.15

 

The relatedness of all pairs of individuals were determined by analysis of the SNPs present on all platforms, and duplicate or closely related individuals were removed to apportion the overlapping samples.  

 

Data were imputed using BEAGLE 3.0 with phased CEU+TSI HapMap3 data as reference.  Each dataset was imputed separately, with each randomly split into imputation batches of 300 individuals, keeping case-control ratio balanced.

 

Multi-dimensional scaling with SNPs genotyped in common across all studies and in relative linkage equilibrium was carried out to assess for population stratification and to derive quantitative indices of ancestry to control for in downstream analyses.

 

After all QC, there were 18,759 individuals (9240 cases and 9519 controls) and 1,235,109 SNPs.  Analyses were done with 2,415,422 SNPs with MAF>0.01 and imputation r2>0.3.  The primary analysis was with logistic regression on single SNP allelic dosages including dummy-coded covariates for study sites and the top 5 quantitative indices of ancestry from the MDS analysis.  The genomic inflation factor (λ) calculated as the ratio between the observed and expected median χ2 statistics was 1.056.

 

Chromosome X SNPs were imputed with data from the 1000 Genomes Project [Durbin et al, 2010].  SNPs with missingness >0.05 or HWE p<1x10-6 were excluded.  Phasing was carried out with MACH [ref] for females.  Imputation was performed separately in males and females using MINIMACH with haplotypes from 351 European samples from the 1000 Genomes Project.  Chromsome X SNPs in HapMap2 or HapMap3 were retained for analysis.  Association tests were carried with logistic regression under an additive model using the same covariates as the autosomal analysis with results for males and females meta-analyzed.

 

 

Genomewide Expression Studies

Literature search and data collection

We exhaustively searched the PubMed database for genome-wide gene expression array studies in MD. We also searched the references of identified studies for thoroughness. In addition to published studies we also queried public microarray repositories including Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) and Array Express (http://www.ebi.ac.uk/arrayexpress/). Studies were included if they reported on a case-control genome-wide gene expression array study in BP in humans. Raw gene expression array data was obtained by scanning the literature to identify GEO accession identifiers (ID) or links for downloadable feature-level extraction output (FLEO) files such as CEL files. If the main text did not contain an accession ID or a link to any FLEO files, we searched existing repositories and the research group’s laboratory web pages. If unsuccessful, we wrote to the authors. If multiple publications used overlapping data, we identified the most comprehensive dataset available.

 

Data processing

In order to consistently handle all datasets and eliminate bias introduced by relying on different algorithms used in the original studies, we obtained the raw data and converted these into analysis-ready gene expression data matrices (GEDM) by processing each study individually using a single analysis pipeline outlined in five steps below:

Step 1: Normalization and background correction using Frozen Robust Multi-array Analysis (fRMA).
Step 2: Removing outliers using Principal Component Analysis (PCA).
Step 3: Filtering probesets using Presence Absence Calls with Negative Probesets (PANP).
Step 4: Removing batch effects using Surrogate Variable Analysis (SVA).
Step 5: Mapping probesets to genes using JetSet.