Overview¶

Scope and Features¶

Just as shown in the following figure, we first use MECAT to do the alignment for all the raw reads and after that we can know the most possible locations of genome for those reads. Usually, only ~10% reads are mapped around gap regions(around 2kbp) and it waste a lot of time for PBJelly to do the alignment for the rest of reads with Blasr since those far away gap region reads were not used to do the local assembly and filled the gaps. So, we can speed up the PBJelly by first removing those reads. Besides, we found that the original PBJelly rudely collected reads to do the local assembly and doing this uasually gives poor results. So, with our approach, it will get better contig N50 value, sometimes.

run_pbjelly is intended to be easy to use but some familiarity with commandline applications is expected. Rather than providing a flexible solution to a number of common workflows, we have designed run_pbjelly to be as fixed as possible, which can help you easily get the results without worrying about installing the other software and setting up path for these program. This design help you save a lot of time, but we suggest you should read the published PBJelly paper and related documents in case you have to do some specific test when the results do not meet the contract indicator.

run_pbjelly is composed of a set of standalone tools to perform specific tasks. A brief description of each tool is shown in the table below.

Tool	Subcommand	Description
Python script		a set of short program designed to complete various data format conversion and process
	filt4seq_v2.0.py	filter reads with length longer than 100kbp or less than 3kbp
	find_gap_v1.0.py	output gap position(*.bed) information for the scaffolds
	m4_to_bed_v3.pl	Convert the MECAT alignment .m4 to the format .bed
	pick_reads_in_raw.pl	picked raw reads according to the id list
PBJelly		aligns long reads to high-confidence draft assembles and filled captured gaps
	setup	Tag sequence names, find gaps, and index the reference
	mapping	Use blasr to map the sequences to the reference
	support	Indentify which reads support with gaps
	extraction	For each gap, consolidate all reads supporting it into a local-assembly folder
	assembly	Build the consensus gap-filling sequence
	output	Stitch the reference sequences and gap-filling sequence together

Input and Output¶

For run_pbjelly, you have to prepared two files, the first one is the reference file name with fasta as end. The second file is ‘reads.list’ file where each line stores the path for the reads files independently. After these, you have to prepare a ‘config.cfg’ file, containing reference file path, reads.list, output dir and all kinds of parameters.

For more information, you can refer to the part ‘Config and Usage’ under the section Examples .

Parameters¶

There are mainly five steps in run_pbjelly, and all the important parameters were saved in the config.cfg file (see more details in the Example section). The parameters are shown as followed:

Category	Sub-category	Description
Data		set up names or dir for files
	sample	the project name
	reads_list	abs path for the file reads.list containg path for reads set
	output_dir	dir where you want to ouput all the results
	ref	the scaffolds file you want to do the gapcloser with
Program		set up parameters for all knids of programs
	mask_len	the gap length for the 1bp gap to extend
	best	the number for MECAT to output best alignments
	reads_len_max	the max length for raw reads
	reads_len_min	the min length for raw reads
	mecat_n	number of of candidates for gap extension
	mecat_b	output the best b alignments
	mecat_mem	memory set for mecat alignment
	mecat_nproc	cpu number set for mecat alignment
	blasr_mem	memory set for blasr alignment
	blasr_nproc	Align using N processes
	assembly_nproc	cpu number set for PBJelly assembly
	qsub_q	specify the queue to use