사용 통계

No 분석 파이프라인 이름 설명 이용 횟수
1 POSTECH_EPIGENOME_SEQUENCING_FASTQC_BOWTIE_MACS_PIPING 각 단계에서 진행되는 분석 과정은 다음과 같다. Quality control 단계에서 입력 데이터의 sequencing quality를 확인한다. Quality filter 단계에서 데이터 중 quality가 낮은 reads를 제거한다. Alignment 단계에서 참조 서열에 기반 해 데이터를 mapping 한다. Cross correlation 단계에서 그 결과에 대해 quality control을 한다. Peak calling 단계에서 유의미한 부위인 peaks를 탐색한다. 이 때, MACS을 사용한다. Annotation 단계에서는 앞 단계에서 찾은 부위들에 대한 상세한 설명을 덧붙인다. Visualization 단계에서는 mapping 데이터와 peaks 데이터를 시각화 한다. 142
2 RNASeq_TOPHAT2_CUFFLINKS_PIPELINE This pipeline Analyze and processes RNA_seq sample _then it assembles transcripts_ estimates their abundances_ and tests for differential expression and regulation in RNA_Seq samples using CUFFLINK_ 86
3 RNASeq_STAR_RSEM_PIPELINE This pipeline is an RNA sequencing pipeline that aligns with the STAR program and performs quntification with RSEM. 53
4 RNASeq_STAR_HTSEQ_PIPELINE This pipeline is an RNA sequencing pipeline that aligns with the STAR program and performs quantification with HTSeq. 51
5 POSTECH_BROAD_SOURCE_CHIP_SEQ_FASTQC_BWA_MACS2_PIPING 각 단계에서 진행되는 분석 과정은 다음과 같다. Quality control 단계에서 입력데이터의 sequencing quality를 확인한다. Quality filter 단계에서 데이터 중 quality가 낮은 reads를 제거한다. Alignment 단계에서 참조 서열에 기반 해 데이터를 mapping 한 후 Mapping이 끝난 데이터의 Mapping Quality 및 duplication level을 확인한다. Visualization 단계에서는 mapping 데이터와 peaks 데이터를 시각화 한다. Peak calling 단계에서 broad-source factor에 특화된 RSEG/SICER/hiddenDomains/BCP/MACS2를 이용해 유의미한 부위인 peak(또는 domain)를 탐색한다. Annotation 단계에서는 앞 단계에서 찾은 부위들에 대한 상세한 설명을 덧붙인다. 45
6 RNASeq_KALLISTO_PIPELINE This pipeline is an RNA sequencing pipeline that performs pseudo alignment and quntification quickly using the Kallisto program_ 40
7 RNASeq_EMSAR_PPIPELINE This pipeline Analyze the RNA_seq to get isoform_level esitmates by EMSAR_ and then it will give you gene_level expression level estimates using isoform_level esitmates_ 27
8 RNASeq_STARFUSION_PIPELINE STAR_Fusion을 이용한 RNA_Seq에서의 Fusion detection pipeline은 Quality Check와 Alignment _ Fusion Prediction의 총 2단계 과정으로 구성된다_ 각 단계에서 진행되는 분석 과정은 다음과 같다_ 첫 번째 분석 단계인_ Quality Check는 입력 데이터의 sequencing quality를 FastQC로 체크한다_ Alignment _ Fusion Prediction 단계로 넘어가기 전에 reference file_reference genome fasta file_ transcriptome annotation file_ blast matching gene pair file_ fusion annotation file_을 indexing하여 reference index를 생성한다_ 이렇게 얻어진 library index 파일을 reference로 mapping을 진행하고 Fusion prediction 과정을 거치면 최종적으로 fusion_prediction_tsv 파일을 얻게 된다_ 여러 옵션을 사용해서 결과에 annotation을 포함할 수도 있으며 fusion_prediction_tsv파일은 후속 분석에 이용하게 된다_ 27
9 POSTECH_INFINIUM450K_RNBEADS_PIPING 각 단계에서 진행되는 분석 과정은 다음과 같다_ 우선 Infinium450K microarray 데이터를 RnBeads 분석에 맞는 RnBSet 객체로 변환한다_ Quality control 단계에서 입력 데이터의 quality를 확인 한 후 SNP_enriched site_ High coverage outlier site_ Low coverage_ Sex chromosome 등 부적합한 데이터를 필터링하고 Normalization을 진행한다_ Explorary analysis 단계에서 유전자 요소 별 메틸화 레벨 프로파일링_ Principal Component Analysis _PCA__ Multidimensional Scaling _MDS__ 클러스터링 등 다양한 글로벌 레벨 분석 수행한다_ Differential methylation analysis에서 샘플간 Methylation 관계를 계산하여 샘플 클러스터링 결과를 보여주고 통계적인 유의성을 표시해준다_ Annotation 단계에서 chromosome site_ color_ context_ GC__ SNP 개수 등의 정보를 얻는다_ Visualization 단계에서 기본적으로 bed 형식 뿐만아니라 다른 트랙허브 사용을 위해 bigbed_ bigwig 형식으로 methylation data를 출력한다_ 위 분석단계는 RnBeads를 이용해 하나의 과정으로 통합하여 진행한다_ 23
10 RNASeq_RSEM_VOOM_PIPELINE Quality control_ Adaptive trimming_ Alignment_ Filter reads_ Quantification_ Differential expression 총 6단계의 모듈로 구성된다_ 각 단계에서 진행되는 분석 과정은 다음과 같다_ 첫 번째 분석 단계인_ Quality control은 입력 데이터의 sequencing quality를 FastQC로 체크한다_ 그리고_ Adaptive trimming 단계는 Sickle를 이용하여 입력 데이터의 quality가 낮은reads와 adaptor를 제거한 후_ R1과 R2의pair를 맞춰서 공통 서열을 얻는다_ 이렇게 얻어진 R1과 R2의 공통서열을 Alignment 단계에서 입력으로 활용하여_ Bowtie1을 이용한 reference의 index를 생성하고_ MapSplice2로 mapping한다_ Filter reads 단계는 mapping된 데이터를 입력으로 활용하여 Picard를 이용하여 mapping된 bam file을 정렬한 후_ SamTools로 genomic location 별로 정렬한 후 performace 를 높여주기위해 indexing 한다_ 그 다음 perl script를 이용하여 reference의 순서와 같도록 chromosome order로 재정렬한 후_ Java scrpit를 이용하여 transcriptome을 annotation한 후 Indel_ Insert가 크거나 mapping이 잘되지 않은 read를 제거한다_ 이렇게 얻어진 bam file을 RSEM을 이용하여 Quantification하여 read를 count한다_ 이 과정에서 FPKM_ TPM_ read count값을 얻을 수 있다_ 마지막 Differential expression 단계에서는 R package Limma voom을 이용하여 유전자 transcripts의 expression levels를 비교하여 differentially expressed genes _DEG_를 얻는다. 21
11 resequencing_pipeline resequencing pipeline 13
12 COLLECTIVE_GENOME_PCA_KIMURA_PIPING Kimura two parameter를 이용한 PCA와 Phylogentic tree pipe 라인은 총 5단계에 걸쳐서 진행되며, 우선 Pipe의 Output 디렉토리를 생성한 후, 첫 번째로 VCF 파일을 Plink format으로 변환, 두 번 째로, 변환된 Plink format의 파일 중 PED 파일을 Fasta 형식의 파일로 변환한다. 셋째, 이 변한된 Fasta 파일을 이용하여, 모든 샘플에 대한 Kimura two paramter distance의 Pariwise matrix를 생성한다. 넷째, 생성된 Pairwise matrix를 이용하여 PCA의 Plot과, Scree Plot을 그리고, 마지막으로 다시 Pairwise matrix를 이용하여 Phylogenetic tree를 그리고, 추가적으로 MEGA7 등에 이용할 수 있는 Newic format을 생성한다. 11
13 POSTECH_BROAD_SOURCE_CHIP_SEQ_FASTQC_BWA_MACS2_PIPELINE 각 단계에서 진행되는 분석 과정은 다음과 같다. Quality control 단계에서 입력데이터의 sequencing quality를 확인한다. Quality filter 단계에서 데이터 중 quality가 낮은 reads를 제거한다. Alignment 단계에서 참조 서열에 기반 해 데이터를 mapping 한 후 Mapping이 끝난 데이터의 Mapping Quality 및 duplication level을 확인한다. Visualization 단계에서는 mapping 데이터와 peaks 데이터를 시각화 한다. Peak calling 단계에서 broad-source factor에 특화된 RSEG/SICER/hiddenDomains/BCP/MACS2를 이용해 유의미한 부위인 peak(또는 domain)를 탐색한다. Annotation 단계에서는 앞 단계에서 찾은 부위들에 대한 상세한 설명을 덧붙인다. 11
14 RNASeq_TOPHAT2_CUFFLINKS_PIPING This pipeline Analyze and processes RNA_seq sample _then it assembles transcripts_ estimates their abundances_ and tests for differential expression and regulation in RNA_Seq samples using CUFFLINK_ 10
15 POSTECH_EPIGENOME_SEQUENCING_FASTQC_BOWTIE_MACS_PIPELINE 각 단계에서 진행되는 분석 과정은 다음과 같다. Quality control 단계에서 입력 데이터의 sequencing quality를 확인한다. Quality filter 단계에서 데이터 중 quality가 낮은 reads를 제거한다. Alignment 단계에서 참조 서열에 기반 해 데이터를 mapping 한다. Cross correlation 단계에서 그 결과에 대해 quality control을 한다. Peak calling 단계에서 유의미한 부위인 peaks를 탐색한다. 이 때, MACS을 사용한다. Annotation 단계에서는 앞 단계에서 찾은 부위들에 대한 상세한 설명을 덧붙인다. Visualization 단계에서는 mapping 데이터와 peaks 데이터를 시각화 한다. 8
16 RNASeq_STAR_HTSEQ_PIPING This pipeline is an RNA sequencing pipeline that aligns with the STAR program and performs quantification with HTSeq 7
17 POSTECH_INFINIUM450K_MICROARRAY_ANALYSIS_RNBEADS_PIPELINE 각 단계에서 진행되는 분석 과정은 다음과 같다_ 우선 Infinium450K microarray 데이터를 RnBeads 분석에 맞는 RnBSet 객체로 변환한다_ Quality control 단계에서 입력 데이터의 quality를 확인 한 후 SNP_enriched site_ High coverage outlier site_ Low coverage_ Sex chromosome 등 부적합한 데이터를 필터링하고 Normalization을 진행한다_ Explorary analysis 단계에서 유전자 요소 별 메틸화 레벨 프로파일링_ Principal Component Analysis _PCA__ Multidimensional Scaling _MDS__ 클러스터링 등 다양한 글로벌 레벨 분석 수행한다_ Differential methylation analysis에서 샘플간 Methylation 관계를 계산하여 샘플 클러스터링 결과를 보여주고 통계적인 유의성을 표시해준다_ Annotation 단계에서 chromosome site_ color_ context_ GC__ SNP 개수 등의 정보를 얻는다_ Visualization 단계에서 기본적으로 bed 형식 뿐만아니라 다른 트랙허브 사용을 위해 bigbed_ bigwig 형식으로 methylation data를 출력한다_ 위 분석단계는 RnBeads를 이용해 하나의 과정으로 통합하여 진행한다_ 7
18 RNASeq_STAR_RSEM_PIPING This pipeline is an RNA sequencing pipeline that aligns with the STAR program and performs quntification with RSEM 6
19 RNASeq_STARFUSION_PIPING STAR_Fusion을 이용한 RNA_Seq에서의 Fusion detection pipeline은 Quality Check와 Alignment _ Fusion Prediction의 총 2단계 과정으로 구성된다_ 각 단계에서 진행되는 분석 과정은 다음과 같다_ 첫 번째 분석 단계인_ Quality Check는 입력 데이터의 sequencing quality를 FastQC로 체크한다_ Alignment _ Fusion Prediction 단계로 넘어가기 전에 reference file_reference genome fasta file_ transcriptome annotation file_ blast matching gene pair file_ fusion annotation file_을 indexing하여 reference index를 생성한다_ 이렇게 얻어진 library index 파일을 reference로 mapping을 진행하고 Fusion prediction 과정을 거치면 최종적으로 fusion_prediction_tsv 파일을 얻게 된다_ 여러 옵션을 사용해서 결과에 annotation을 포함할 수도 있으며 fusion_prediction_tsv파일은 후속 분석에 이용하게 된다. 2
20 COLLECTIVE_GENOME_BETWEEN_GROUPS_PI_CALCULATION_PIPING VCFtools를 이용하여 각 WG 그룹의 Nucleotide diversity(Pi) 계산 및 시각화하는 라인은 총 7단계에 걸쳐서 진행되며, 우선 Pipe의 Output 디렉토리를 생성한 후, 첫 째로 그 그룹별 VCF 파일을 이용하여 계산하고자 하는 Window size와 step에 맞게 각각 Pi 값을 계산하고, 둘째로 그룹별로 계산된 Pi 값의 기본적인 통계량을 산출한 후, 셋째, 각 그룹별로 계산되 Pi 값들을 통합적으로 시각화 하기 위해 Pi 값들의 이상여부를 체그 한 후, 넷 째, 각 그룹별로 계사된 Pi 값을 시각화하기 위한 파일로 변환 하고, 다섯 째, 이 각 그룹별로 변환된 파일을 시각화하기 위한 최종 input인 하나의 파일로 병합한다. 그리고 마지막으로 최종 input을 이용하여 각 그룹에 대한 Pi 값을 시각화 한다. 1
21 COLLECTIVE_GENOME_LD_DECAY_CALCULATION_PIPING PopLDdecay software를 이용하여 각 WG 그룹의 특정 거리에 따른 LD의 평균적인 붕괴지수 산출 및 시각화하는 Pipe 라인은 총 5단계에 걸쳐서 진행되며, 우선 Pipe의 Output 디렉토리를 생성한 후, 첫 째로 그 그룹별 VCF 파일을 PopLDdecay라는 software를 이용하여 계산하고자 하는 거리에 맞게 각각 그룹에 대한 평균적인 LD의 붕괴정도를 계산하고, 둘째로 그룹별로 계산된 LD의 붕괴 값을 각각 시각화 하기위한 파일로 변환하고, 셋째, 이 각 그룹별로 변환된 파일을 시각화하기 위한 최종 input인 하나의 파일로 병합한다. 그리고 마지막으로 최종 input을 이용하여 각 그룹에 대한 거리에 따른 평균적인 LD의 붕괴정도를 시각화 한다. 1
22 RNASeq_KALLISTO_PIPING This pipeline is an RNA sequencing pipeline that performs pseudo alignment and quntification quickly using the Kallisto program 1
23 RNASeq_EMSAR_PIPING This pipeline Analyze the RNA_seq to get isoform_level esitmates by EMSAR_ and then it will give you gene_level expression level estimates using isoform_level esitmates 1
24 RNASeq_RSEM_VOOM_PIPING Quality control Adaptive trimming_ Alignment_ Filter reads_ Quantification_ Differential expression 총 6단계의 모듈로 구성된다_ 각 단계에서 진행되는 분석 과정은 다음과 같다_ 첫 번째 분석 단계인_ Quality control은 입력 데이터의 sequencing quality를 FastQC로 체크한다_ 그리고_ Adaptive trimming 단계는 Sickle를 이용하여 입력 데이터의 quality가 낮은reads와 adaptor를 제거한 후_ R1과 R2의pair를 맞춰서 공통 서열을 얻는다_ 이렇게 얻어진 R1과 R2의 공통서열을 Alignment 단계에서 입력으로 활용하여_ Bowtie1을 이용한 reference의 index를 생성하고_ MapSplice2로 mapping한다_ Filter reads 단계는 mapping된 데이터를 입력으로 활용하여 Picard를 이용하여 mapping된 bam file을 정렬한 후_ SamTools로 genomic location 별로 정렬한 후 performace 를 높여주기위해 indexing 한다_ 그 다음 perl script를 이용하여 reference의 순서와 같도록 chromosome order로 재정렬한 후_ Java scrpit를 이용하여 transcriptome을 annotation한 후 Indel_ Insert가 크거나 mapping이 잘되지 않은 read를 제거한다_ 이렇게 얻어진 bam file을 RSEM을 이용하여 Quantification하여 read를 count한다_ 이 과정에서 FPKM_ TPM_ read count값을 얻을 수 있다_ 마지막 Differential expression 단계에서는 R package Limma voom을 이용하여 유전자 transcripts의 expression levels를 비교하여 differentially expressed genes DEG를 얻는다. 1
No 분석 프로그램 설명 형태 이용 횟수
1 bedtools_genomecov_bam one wants to measure the genome wide coverage of a feature file. (Bam 파일의 genome 범위를 측정하는 과정) LINUX 18
2 big_bwa_mem Hadoop to boost the performance of the Burrows-Wheeler Aligner (BWA - works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW)). (Hadoop 기반으로 BWA의 affine-gap Swmith-Watemant 알고리즘으로 시드를 확장하여 정렬하는 과정) HADOOP 4
3 bowtie2_build bowtie2-build builds a Bowtie index from a set of DNA sequences. bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. LINUX 41
4 bowtie_build bowtie-build builds a Bowtie index from a set of DNA sequences. bowtie-build outputs a set of 6 files with suffixes .1.ebwt, .2.ebwt, .3.ebwt, .4.ebwt, .rev.1.ebwt, and .rev.2.ebwt. (If the total length of all the input sequences is greater than about 4 billion, then the index files will end in ebwtl instead of ebwt LINUX 42
5 bowtie_pe Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour (sam file을 생성하는 과정) LINUX 14
6 bowtie_se Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour (sam file을 생성하는 과정) LINUX 27
7 bwa_aln Find the SA coordinates of the input reads. Maximum maxSeedDiff differences are allowed in the first seedLen subsequence and maximum maxDiff differences are allowed in the whole sequence (SA 좌표를 찾는 과정) LINUX 16
8 bwa_aln_sampe Find the SA coordinates of the input reads. Maximum maxSeedDiff differences are allowed in the first seedLen subsequence and maximum maxDiff differences are allowed in the whole sequencea enerate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly LINUX 1
9 bwa_index Index database sequences in the FASTA format LINUX 51
10 bwa_mem Align 70bp-1Mbp query sequences with the BWA-MEM algorithm. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW) LINUX 48
11 bwa_sampe Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly (paired-end reads로 SAM 형식으로 정렬하는 과정) LINUX 16
12 bwa_samse Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen (single-end reads로 SAM 형식으로 정렬하는 과정) LINUX 16
13 clustalo Clustao is a general purpose multiple sequence alignment program (다목적 다중 서열 정렬 하는 과정) LINUX 14
14 cmpfastq A simple perl program that allows the user to compare QC filtered fastq files LINUX 16
15 cuffdiff Comparing expression levels of genes and transcripts in RNA-Seq experiments is a hard problem. Cuffdiff is a highly accurate tool for performing these comparisons, and can tell you not only which genes are up- or down-regulated between two or more conditions, but also which genes are differentially spliced or are undergoing other types of isoform-level regulation LINUX 45
16 cufflinks Cufflinks is both the name of a suite of tools and a program within that suite. Cufflinks the program assembles transcriptomes from RNA-Seq data and quantifies their expression LINUX 44
17 cuffmerge Transcriptome assembly and differential expression analysis for RNA-Seq (transcriptome assembly 및 RNA-Seq에 대한 미분 발현 분석 과정) LINUX 40
18 cutadapt Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads (reads로 부터 sequences를 trim adapter하는 과정) LINUX 16
19 decompress This program decompresses compressed files(tar.gz, tar.bz2, tar.xz, tar, gz, bz2, xz, zip) LINUX 3
20 emsar_build_pe Process of using the Transcriptome.fa file make easy computation file it is an index file(.rsh file) (Transcriptome.fa file을 이용하여 easy computation file 인 index file(.rsh file) 생성하는 과정) LINUX 10
21 emsar_postprocessing Process of calculating TPM value about sample using the 'gfpkm', 'gene read count' and 'gen' ('gfpkm', 'gene read count', 'gene'을 이용하여 sample에 대한 TPM값을 계산하는 과정) LINUX 10
22 emsar_transcript_stat Process of making stats file to express GC content, isoform information of transcript (transcript의 GC content, isoform 정보 등을 나타내는 stats file 만드는 과정) LINUX 10
23 fastqc FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis LINUX 112
24 fastx_fastq_quality_filter Filters sequences based on quality (low quality sequence를 filtering 하기 위한 과정) LINUX 41
25 fastx_fastx_artifacts_filter Process of removing some of the base sequence artifacts(일부 염기 서열 artifacts를 제거 하는 과정) LINUX 16
26 fq_split The process of splitting the fastq file (fastq 파일을 split 하는 과정) LINUX 1
27 gatk_analyzecovariates This tool generates plots for visualizing the quality of a recalibration run (보정 실행의 품질을 시각화 하기 위한 플롯을 생성하는 과정) LINUX 18
28 gatk_analyzecovariates_single Evaluate and compare base quality score recalibration tables This tool generates plots to assess the quality of a recalibration run as part of the Base Quality Score Recalibration (BQSR) procedure. Summary of the BQSR procedure The goal of this procedure is to correct for systematic bias that affects the assignment of base quality scores by the sequencer. The first pass consists of calculating error empirically and finding patterns in how error varies with basecall features over all bases. The relevant observations are written to a recalibration table. The second pass consists of applying numerical corrections to each individual basecall based on the patterns identified in the first step (recorded in the recalibration table) and writing out the recalibrated data to a new BAM or CRAM file. (with single file) LINUX 29
29 gatk_baserecalibrator First pass of the base quality score recalibration. Generates a recalibration table based on various covariates. The default covariates are read group, reported quality score, machine cycle, and nucleotide context. This walker generates tables based on specified covariates. It does a by-locus traversal operating only at sites that are in the known sites VCF. ExAc, gnomAD, or dbSNP resources can be used as known sites of variation. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. Since there is a large amount of data one can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, num observations, num mismatches, empirical quality score) LINUX 46
30 gatk_baserecalibrator_two_databases The Genome Analysis Toolkit or GATK is a software package for analysis of high-throughput sequencing data, developed by the Data Science and Data Engineering group at the Broad Institute (2개의 데이터베이스로 시퀀싱 데이터를 변형, 분석하는 과정) LINUX 18
31 gatk_depthofcoverage Assess sequence coverage by a wide array of metrics, partitioned by sample, read group, or library (샘플, 읽기 그룹 또는 라이브러리별로 파티셔닝 된 다양한 메트릭으로 시퀀스 범위를 평가하는 과정) LINUX 16
32 gatk_haplotypecaller Call germline SNPs and indels via local re-assembly of haplotypes The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper. In the GVCF workflow used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate GVCF (not to be used in final analysis), which can then be used in GenotypeGVCFs for joint genotyping of multiple samples in a very efficient way. The GVCF workflow enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes LINUX 19
33 gatk_indelrealigner The local realignment process is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realig ment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment. Local realignment serves to transform regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches LINUX 64
34 gatk_printreads Write reads from SAM format file (SAM/BAM/CRAM) that pass criteria to a new file. A common use case is to subset reads by genomic interval using the -L argument. Note when applying genomic intervals, the tool is literal and does not retain mates of paired-end reads outside of the interval, if any. Data with missing mates will fail ValidateSamFile validation with MATE_NOT_FOUND, but certain tools may still analyze the data. If needed, to rescue such mates, use either FilterSamReads or ExtractOriginalAlignmentRecordsByNameSpark.By default, PrintReads applies the WellformedReadFilter at the engine level. What this means is that the tool does not print reads that fail the WellformedReadFilter filter. You can similarly apply other engine-level filters to remove specific types of reads with the --read-filter argument. See documentation category 'Read Filters' for a list of available filters. To keep reads that do not pass the WellformedReadFilter, either disable the filter with --disable-read-filter or disable all default filters with --disable-tool-default-read-filters. The reference is strictly required when handling CRAM files. LINUX 63
35 gatk_realignertargetcreator Define intervals to target for local realignment LINUX 64
36 gatk_selectvariants Select a subset of variants from a VCF file This tool makes it possible to select a subset of variants based on various criteria in order to facilitate certain analyses. Examples of such analyses include comparing and contrasting cases vs. controls, extracting variant or non-variant loci that meet certain requirements,or troubleshooting some unexpected results, to name a few LINUX 18
37 gatk_variantfiltration Filter variant calls based on INFO and/or FORMAT annotations This tool is designed for hard-filtering variant calls based on certain criteria. Records are hard-filtered by changing the value in the FILTER field to something other than PASS. Filtered records will be preserved in the output unless their removal is requested in the command line LINUX 18
38 gsa_seq At Ambry, Sanger gene sequencing is performed on specific regions of DNA that have been amplified by polymerase chain reaction (PCR). Double stranded sequencing occurs in both sense and antisense directions to detect sequence variations. For Specific Site Analysis, specific region(s) of DNA is/are amplified by PCR and sequenced. Sanger sequencing is performed for any regions missing or with insufficient read depth coverage for reliable heterozygous variant detection. Suspect variant calls other than "likely benign" or "benign" are verified by Sanger sequencing (RNA-Sequencing 데이터를 실험군과 대조군과의 발현의 양을 비교하여 차이가 나는 생물학적 기능을 밝혀 주는 과정) LINUX 1
39 hadoop_bam_cat Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework with the Picard SAM JDK, and command line tools similar to SAMtools. Cat is concatenation of partial SAM and BAM files HADOOP 1
40 hadoop_bam_fixmate Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework with the Picard SAM JDK, and command line tools similar to SAMtools. Fixmate algorism has BAM and SAM mate information fixing HADOOP 1
41 hadoop_bam_index Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework with the Picard SAM JDK, and command line tools similar to SAMtools. Index algorism is indexing BAM file HADOOP 1
42 hadoop_bam_sort Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework with the Picard SAM JDK, and command line tools similar to SAMtools. Sort algorism does sorting and merging BAM or SAM file HADOOP 1
43 hadoop_blastp An algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences, based on hadoop.(Hadoop 기반의 단백질의 아미노선 서열 또는 DNA 서열의 nucleotides와 같은 생물학적 서열 정보를 비교하는 과정) HADOOP 1
44 homer_annotatepeaks All-in-one program for performing peak annotation (Peak calling의 결과인 peak의 genomic feature를 확인하는 과정) LINUX 25
45 homer_maketagdirectory To facilitate the analysis of ChIP-Seq (or any other type of short read re-sequencing data), it is useful to first transform the sequence alignment into platform independent data structure representing the experiment, analogous to loading the data into a database (Genome상에 있는 모든 위치의 ChIP-fragment density를 보여주는 과정) LINUX 25
46 homer_makeucscfile The UCSC Genome Browser is quite possibly one of the best computational tools ever developed. Not only does it contain an incredible amount of data in a single application, it allows users to upload custom information such as data from their ChIP-Seq experiments so that they can be easily visualized and compared to other information (bedgraph format file을 만드는 과정) LINUX 25
47 interproscan Users who have novel nucleotide or protein sequences that they wish to functionally characterise can use the software package InterProScan to run the scanning algorithms from the InterPro database in an integrated way. Sequences are submitted in FASTA format. Matches are then calculated against all of the required member database's signatures and the results are then output in a variety of formats (fasta 포맷의 Sequences데이터를 입력 받아 scanning 알고리즘을 실행하여 데이터베이스의 모든 signatures에 대해 계산한 테이블을 만드는 과정) LINUX 23
48 last LAST_lastdb, LAST_lastal, LAST_split, LAST_maf-swap, LAST_maf-convert LINUX 26
49 macs_pe Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer (histone enriched regions을 찾는 과정) LINUX 25
50 make_g2b Process of sorting chromosome gtf file and to convert bed file to gtf file (gtf 파일을 chromosome 순으로 정렬한 다음 gtf 파일을 bed 파일로 변환하는 과정) LINUX 21
51 make_g2t Process of using the gene id and transcript id creating g2t file (gene id와 그에 맞는 transcript id를 이용하여 g2t file을 만드는 과정) LINUX 10
52 make_matrix_count Process of Making Matrix for limma voom(Enter 10 or fewer isoforms.result file) LINUX 21
53 mapsplice2 MapSplice is a software for mapping RNA-seq data to reference genome for splice junction discovery that depends only on reference genome, and not on any further annotations LINUX 21
54 muscle MUSCLE is one of the best-performing multiple alignment programs according to published benchmark tests (빠른 속도로 정확하게 시퀀스를 다중 정렬하는 과정) LINUX 3
55 picard_addorreplacereadgroups Replace read groups in a BAM file.This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file LINUX 85
56 picard_buildbamindex Generates a BAM index ".bai" file. This tool creates an index file for the input BAM that allows fast look-up of data in a BAM file, lke an index on a database. (이 도구는 데이터베이스의 인덱스와 같은 BAM 파일의 데이터를 빠르게 검색 할 수 있도록 입력 BAM에 대한 인덱스 파일을 만드는 과정) LINUX 18
57 picard_collecinsertsizemetrics This tool provides useful metrics for validating library construction including the insert size distribution and read orientation of paired-end libraries (insert size distribution 의 통계를 제공하고, paired-end 라이브러리 방향을 제공하는 과정) LINUX 34
58 picard_collectalignmentsummarymetrics Produces a summary of alignment metrics from a SAM or BAM file (SAM 또는 BAM 파일에서 정렬 메티릭을 요약하는 과정) LINUX 18
59 picard_createsequencedictionary Creates a sequence dictionary for a reference sequence. This tool creates a sequence dictionary file (with ".dict" extension) from a reference sequence provided in FASTA format, which is required by many processing and analysis tools. The output file contains a header but no SAMRecords, and the header contains only sequence records. LINUX 63
60 picard_fixmateinformation Verify mate-pair information between mates and fix if needed.This tool ensures that all mate-pair information is in sync between each read and its mate pair. If no OUTPUT file is upplied then the output is written to a temporary file and then copied over the INPUT file. eads marked with the secondary alignment flag are written to the output file unchanged LINUX 45
61 picard_markduplicates Replace read groups in a BAM file.This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file LINUX 64
62 picard_mergesamfiles Merges multiple less than 6 SAM and/or BAM files into a single file. This tool is used for combining SAM and/or BAM files from different runs or read groups, similarly to the "merge" function of Samtools (6개 이하의 SAM 또는 BAM 파일을 병합하는 과정) LINUX 16
63 picard_sortbam Sorts a BAM file. This tool sorts the input BAM file by coordinate, queryname (QNAME), or some other property of the SAM record (Bam 파일을 Sort 하는 과정) LINUX 1
64 picard_sortsam Sorts a SAM or BAM file. This tool sorts the input SAM or BAM file by coordinate, queryname (QNAME), or some other property of the SAM record. The SortOrder of a SAM/BAM file is found in the SAM file header tag @HD in the field labeled SO. LINUX 49
65 qiime_align_seqs Aligns the sequences in a FASTA file to each other or to a template sequence alignment, depending on the method chosen. (fasta 파일의 서열을 템플레이트 서열 정렬에 정렬 시키는 과정) LINUX 11
66 qiime_alpha_rarefaction Generate rarefied OTU tables, compute alpha diversity metrics for each rarefied OTU table collate alpha diversity results and generate alpha rarefaction plots.(rarefied OTU 테이블을 생성하고, 알파 diversity 메트릭스를 계산하여 결과에 대한 대조를 통해 알파 플롯을 생성하는 과정) LINUX 10
67 qiime_assign_taxonomy Assign taxonomy to each sequence. (각 시퀀스에 taxonomy를 할당하는 과정) LINUX 10
68 qiime_beta_diversity_through_plots perform beta diversity, principal coordinate analysis, and generate a preferences file along with 3D PCoA Plots.(배타 다양성과 주요 좌표 분석을 수행하고 3D PCoA 플롯을 생성하는 과정) LINUX 10
69 qiime_filter_alignment Filter sequence alignment by removing highly variable regions. (가변적인 영역을 제거함으로써 시퀀스를 정렬하는 과정) LINUX 10
70 qiime_make_otu_mapping_table The number of times an OTU is found in each sample, and adds the taxonomic predictions for each OTU in the last column if a taxonomy file is supplied. (마지막열에 각 OTU에 대한 분류학적 예측을 추가하여 OTU table을 만드는 과정) LINUX 10
71 qiime_make_otu_table The number of times an OTU is found in each sample, and adds the taxonomic predictions for each OTU in the last column if a taxonomy file is supplied. (마지막열에 각 OTU에 대한 분류학적 예측을 추가하여 OTU table을 만드는 과정) LINUX 10
72 qiime_make_phylogeny Produces this tree from a multiple sequence alignment. (multiple sequence 정렬에서 트리를 생성하는 과정) LINUX 10
73 qiime_pick_otus The OTU picking step assigns similar sequences to operational taxonomic units, or OTUs, by clustering sequences based on a user-defined similarity threshold (임계값에 따라 시퀀스를 클러스터링하여 OTU에 유사한 시퀀스를 할당하는 과정) LINUX 10
74 qiime_pick_rep_set After picking OTUs, you can then pick a representative set of sequences. (대표적인 일련의 시퀀스를 얻는 과정) LINUX 10
75 qiime_split_libraries Split libraries according to barcodes specified in mapping file (매핑 파일에 지정된 바코드에 따라 라이브러리를 분할하는 과정) LINUX 10
76 qiime_summarize_taxa_through_plots Summarize OTU by Category (optional, pass -c); Summarize Taxonomy; and Plot Taxonomy Summary.(OTU를 범주별로 요약하는 과정) LINUX 10
77 qiime_validate_mapping_file Checks user’s metadata mapping file for required data, valid format. (메타 데이터 mapping 파일에서 유효한 형식인지 파악하는 과정) LINUX 11
78 r_summary_for_flagstat_depthofcoverage Group generic methods can be defined for four pre-specified groups of functions, Math, Ops, Summary and Complex. (There are no objects of these names in base R, but there are in the methods package.) (R로 요약하여, 하나의 Table로 만드는 과정) LINUX 16
79 r_voom The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline (mutation vs. wild type의 차이의 관계를 나타내는 design matrix를 생성하고, design matrix를 이용하여 voom transformation을 통해 quantile normalization을 하는 과정) LINUX 21
80 rsem_bam_calculate_expression Aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments (Filtering한 BAM file을 이용하여 reference에 따라 read를 count 하는 quantification 하는 과정) LINUX 21
81 rsem_prepare_reference Building references from a genome. RSEM can extract transcript sequences from the genome based on a given GTF file (with bowtie2) LINUX 21
82 samtools_faidx Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create .fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format LINUX 64
83 samtools_flagstat Does a full pass through the input file to calculate and print statistics to stdout (전체 과정을 계산하고, print statistics 하는 과정) LINUX 37
84 samtools_index Index a coordinate-sorted BAM or CRAM file for fast random access. (Note that this does not work with SAM files even if they are bgzip compressed to index such files, use tabix(1)instead.) LINUX 35
85 samtools_index_bam Index a coordinate-sorted BAM or CRAM file for fast random access (BAM file에 random하게 접근할 때, performance를 높여주기 위한 indexing 과정) LINUX 85
86 samtools_index_sam Index a coordinate-sorted BAM or CRAM file for fast random access (SAM file에 random하게 접근할 때, performance를 높여주기 위한 indexing 과정) LINUX 1
87 samtools_rmdup Remove potential PCR duplicates- if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (potential PCR 중복을 제거하는 과정) LINUX 43
88 samtools_sort Sort alignments by leftmost coordinates, or by read name when -n is used. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary LINUX 32
89 samtools_view With no options or regions specified, prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header) LINUX 44
90 sickle_pe Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads (with pair file) LINUX 94
91 snpeff SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of variants on genes (such as amino acid changes). (variant annotation 및 예측을 위한 과정, 유전자에 대한 변이형을 예측하고 표현하는 과정) LINUX 18
92 spark_bwa_mem SparkBWA MEM is a tool that integrates the Burrows-Wheeler Aligner--BWA on a Apache Spark framework running on the top of Hadoop HADOOP 2
93 tophat2 TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons LINUX 44
94 trimmomatic_pe Trimmomatic is a fast, multithreaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters. The paired end mode will maintain correspondence of read pairs and also use the additional information contained in paired reads to better find adapter or PCR primer fragments introduced by the library preparation process (paired-end read들에서 adpater를 제거하고, illumina fastq 데이터를 자르는 과정) LINUX 1
95 ubu_filtering Filter reads from a paired end SAM or BAM file (only outputs paired reads) (Indels, inserts가 크게 된 것 그리고 mapping이 잘 안 된 것은 제거하는 과정) LINUX 21
96 ubu_sort_bam Parameterize samtools properly (Alignment를 마친 BAM File 을 reference 순서와 같도록 chromosome 별로 정렬 하는 과정) LINUX 21
97 ubu_translate Translate from genome to transcriptome coordinates (Chromosome순으로 정렬된 BAM file에 transcriptome을 annotation 하는 과정) LINUX 21
98 vina Vina is an open-source program for doing molecular docking. Vina is one of the two generations of distributions of AutoDock. This software uses a sophisticated gradient optimization method in its local optimization procedure. The calculation of the gradient effectively gives the optimization algorithm a “sense of direction” from a single evaluation. By using multithreading, this software can further speed up the execution by taking advantage of multiple CPUs or CPU cores LINUX 2