by Haibin Shu Haibin Shu No Comments

Is Your Statistical Programmer A Yes-man?

Often times many project statisticians enjoy working with “yes-man” type statistical programmers. In the case of a typical clinical trial the project statistician not only provides essential statistical expertise, but also manages the biometric deliverables such as productions of all analysis outputs in a timely manner. On the other hand, the statistical programmer works very closely with the project statistician to carry out the statistical analysis methods by providing important programming support. It’s not a surprise for an organization to have the statistical programmer report directly to the project statistician to emphasize and reinforce this relationship.

Advantages of a yes-man statistical programmer, to name a few:  

  • Minimizing back and force discussions: this could potentially reduce the timeline had both parties focused more time on their respective assignments
  • Carrying out the statistical analysis by straight-forward programming: when the project statistician is the thinker and the statistical programmer is just the doer (or coder), the latter’s work becomes simple and straight-forward
  • Avoiding conflict: saying yes in all circumstances eliminates conflicts

For comparison, some disadvantages of a yes-man statistical programmer:  

  • Lacking important inputs:

1) The statistical programmer may weigh in valuable opinions from unique/different angles, e.g. availability of data sets for certain analyses, convergence issues, and formatting issues etc. If the statistical programmer has to bring out certain issues before proceeding, the sooner the better. Not bringing them up earlier might risk delaying completion of the project.

2) The statistical programmer should thoroughly study and understand the statistical analysis plan (SAP) document. In most circumstances the statistical programmer is given an opportunity to review the SAP while it’s being drafted. Lack of statistical programmer’s inputs might leave the SAP at risk of being amended later for important contents regarding how to effectively carry out the analyses by statistical programming.

  • Lacking independent checks: The statistical programmer should independently verify the analysis results produced by the project statistician and vice versa. Taking as is by either party might risk/compromise the accuracy of the analysis results.
  • Lacking ownership: The statistical programmer may become less actively involved because decisions are likely made unilaterally by the project statistician. This lack of ownership could potentially lead to less productivity in the overall project.


There’re always two sided stories about yes-man statistical programmers: advantages vs. disadvantages. Efforts should be made to cultivate a more productive relationship that inspires independent peer-to-peer checks without sacrificing the efficiency of close collaborations.

by Haibin Shu Haibin Shu No Comments

Why Does Hybrid Work Better?

A traditional clinical database platform (CDB) is built upon a relational database system and generally is equipped with powerful functions for query management and tracking capability. A modern CDM in the age of electronic data capture (EDC) additionally provides user-friendly interface for site users, i.e. site coordinators, investigators and site monitors etc. All these features support data management activities (DM) on an ongoing basis and timely access to real-time clinical data for analyses and pre-planned business decisions. None of these features are readily available in a SAS environment.

However SAS programming plays a unique and important role in supporting DM [1][2]. SAS is a language built for “specialty” programmers. A person without coding background would have difficulty using it; a programmer without domain knowledge (such as clinical trial knowledge for clinical SAS programmers) would also find it difficult to use, e.g. writing a piece of code to cross check data consistency between adverse events (AE) and concomitant medications (CM) panels/data sets without first understanding the clinical relevance of the two data sets.


Cross panel checks – too much work in CDB

It is generally a time consuming process to develop complex cross-panel checks in CDB, e.g. the checks between lab data (LB) and AE. Most CDBs don’t seem to provide an easy-to-use tool that may link panels at the Visit and/or Event levels. For CDBs that allow in-between dynamic form links the difficulty is mitigated by selecting certain key information that is dynamically populated at data entry, e.g. identifying the associated AE number from a pull-down list (retrievable through pre-built link with the AE page(s)) while entering CM page. However, for CDBs that don’t have such advanced features, custom codes/solutions may have to be developed for a considerable amount of time by an experienced CDB designer/developer. In either situation, the solutions will add cost and are limited since only working for a particular type of cross panel checks. For example, the dynamic form link approach won’t work for LB and AE checks for studies that LB data is uploaded instead of being entered at the Site(s).

On the other hand, SAS is extremely useful and handy to develop edit checks on complex logic checks. What seems to be difficult to link panels at the given Visit and/or Event levels is implemented easily by simple data steps that merge the corresponding data sets by the order variables that include Visit Number and Event or Record Sequential Number etc. By doing this, an advantage is added by avoiding the heavy workload on change control process had it been totally developed in CDB.

Combining CDB and SAS provides an excellent and cost-effective solution

Therefore, combing CDB and SAS provides an excellent and cost-effective solution for the current edit check requirements that involve complex cross-panel logics. Obviously in this two-party environment, CDB is the hosting environment that hosts all DM activities including most edit checks. SAS plays a supplemental role that specifically addresses complex edit checks. It is important to update information consistently between CDB and SAS [2]. It should:

  • Avoid duplicates: SAS should only present incremental information to CDB to follow up with query actions, i.e., queries that were fired previously should not be re-fired. This includes queries that have been fired but closed manually
  • Avoid confusion: queries that no longer fire in SAS due to CDB data updates should be closed in CDB promptly

The ideal process

An ideal process is to establish a staging environment that stores and reconciles information from both CDB and SAS on an ongoing basis. It can be a file in Excel format or SAS format or other format. The methodology presented in [3] can be generalized for this staging environment:


Using the hybrid technique to satisfy the requirements of complex edit checks in CDB is a viable and cost-saving approach. SAS proves to be an effective programming language to supplement the programming functions of CDB developer/designer thanks to the powerful data manipulation capability.


[1] Gupta, S. Standards for Clinical Data Quality and Compliance Checks, available at

[2] N.E.A.T._Abstract, available at

[3] Shu, H. et al, Smart Programming and SAE Reconciliation, PharmaSUG2010, paper DM06

by Haibin Shu Haibin Shu No Comments

How to Protect Data Integrity?

Just as integrity is an important characteristic of a human being, data integrity lays the foundation for valid analyses and reports. The integrity of data includes the following important components:

  • Accuracy: truly reflects source records and free of transcription errors. It mirrors the presence of honesty and truthfulness.
  • Consistency: free of logical errors, consistency across domains, visits, and devices etc. (e.g., collecting the same measurement for multiple times during the course of study) Consistency between data entries and data extraction such as data types, formats, conventions etc. This requirement of consistency corresponds to the personality of dependability and accountability.
  • Security: defines roles, access levels, scopes, and activities etc. It’s important to allow the right role to perform the right tasks and prevent the other way around. This talks about the characteristics of self-control and self-discipline.
  • Traceability: who did what at what time, and/or why. Just like in reality, keeping a track-record is essential for data integrity.

How to protect data integrity?

Protecting data integrity requires commitment of 4 Ps.

  • Platform: a secured system that is Part 11 compliant and adequate to provide good front end for data entries, discrepancy management, and other end-user interfaces. It also provides a strong back end to allow programmers to design/implement a database with required edit checks, metrics reports and other important/study specific functions in an efficient and effective way.
  • Process: a rigorous process specifies roles, functions, workflows, team-work, and responsibilities etc. It should include change control management.
  • People: having committed people is the key in the whole picture of protecting data integrity. Paying attention to details, being sensitive to any potential deviations that might compromise data integrity and taking proactive steps before any possible mistakes might take place demonstrate the needed commitment. Timely and ongoing training efforts would help promote more people to become committed to data integrity.
  • Passion: protecting data integrity requires corporate-level consensus and collaborative efforts.
by Haibin Shu Haibin Shu No Comments

Left-hand Programing vs. Right-hand Programming?

Double programming is a gold standard in a team of statistical programming and analysis. The necessity of doing so, is to ensure that the data is being processed correctly and the analysis is being conducted correctly following the pre-specified requirements such as the SAP document. In the end, the beauty of the practice is that same results are achieved by two or more different approaches. Often times the approaches are independent from each other and might even begin with divergent understanding of particular analysis methods. Ultimately, accuracy will be achieved when differences are reconciled and critical understandings converge, just like the team-work between left-hand and right-hand!

Generally speaking, the initial programmer has certain advantages such as choosing/using naming conventions, setting up output layout formats, and applying statistical procedures etc. The QC programmer would generally focus on checking the accuracy of the content of the output and analysis results. Plus, they would follow the variety of conventions that has been set up by the initial programmer.

Some factors to consider when facilitating a strong collaboration between left-hand programming and right-hand programming are:

  • A win-win culture: approaches are independent but the goal is the same. Commonly, the final goal is to prevent any mistakes in any programming and analysis. The initial programmer should always self-check first for initial quality assurance before handing over to the QC programmer for independent reviews.
  • Avoid cosmetic over-do: both parties should stick to simple, common, and effective conventions. Making too many extremely detailed formatting efforts could result in extra time and more difficulty in reconciliation of non-essential content, e.g., concatenating variables by special characters and calculated spaces.
  • Having constant contact: both parties should talk to each other constantly in order to generate high productivity. Changes are usually inevitable. For example, analysis methods may change multiple times in the course of a study before the SAP finalization. Both parties may start revising the respective programs in parallel (not sequentially) had they all be informed of the change requests in the same time.
by Haibin Shu Haibin Shu No Comments

Does One Clinical System Fit All Studies?

The answer is ‘No’ because it’s very hard to build or find a system that can fit into the needs of all clinical trials. Therefore selecting the most sufficient-effective system would become very important prior to launching the upcoming clinical trials.

In theory one would think it might be possible to include all factors when creating such a system, in reality however, the efforts might be too much to be justified. Let alone the ever evolving status of clinical trial requirements – e.g. the best paper-based system in history would fail to address the basic needs of a simple study nowadays that requires electronic data capture.

Some factors to consider when selecting a clinical system:

  • Experiences of site users: less experienced users may require much more robust and less error prone systems to work with for data entries and other necessary tasks.
  • Study design including visit structures, key data points etc.: the more complex a study is the more programming requirements it might need to implement such as edit checks, metric reports etc. So it may require a system that provides strong programming capabilities in the backend.
  • SAS extracts: many systems are equipped with on-demand-type data extracts which make it very convenient to extract real-time data for necessary reporting and analyses. Contrarily for systems that don’t provide such capability it might become a daunting task to get SAS data sets out of the system and nothing could be done until SAS data sets can be generated.
  • Balance of front-end and back-end: a system can easily be voted down if the front end functions aren’t as desirable; on the other hand attentions should be paid to the back end functions as well. Selecting a system with friendly front pages but insufficient back end functions equals driving a nice looking car without a good engine!

by Haibin Shu Haibin Shu No Comments

How to Setup Dropbox as an Effective Programming Environment?

Dropbox provides a cost-effective, secured, and sharing environment for SAS programmers. It’s quick to set up and easy to operate. Furthermore, it provides a server-like platform for programmers to develop, share, and execute codes.

A Cost-Effective Environment

Dropbox is a cost-effective web-based application that provides file storage and file management functions that are integrated with local systems. It provides solutions without heavy IT spending by reducing procurement of hardware and software, and maintenance needs.

Quick Set-up and Easy Operation

  1. Download and install Dropbox desktop application
  2. Share the Dropbox network path, e.g.

3.  Map the above network path to a common drive, e.g. Y:


Synchronization can be done selectively –

As long as the same letter is used to map Dropbox and the same study folders structure is used both programs and data sets can be invoked from any computers with SAS.

by Melissa Melissa No Comments

How To Ensure Your Topline Results Are Correct?

How to ensure your topline results are correct has become more challenging when the analysis is prepared in a CDISC environment. This is a good thing since CDISC provides a standard and uniform framework for clinical data sharing and reviewing. However, many data manipulations are often involved in creating these CDISC data sets which inevitably leads to a natural but critical question: how to make sure all these intermediate steps that deal with data changes/formatting won’t unintentionally introduce any errors/bias into the analysis results?

Approach 1. Double Programming to Make Sure CDISC Conversions Are Accurate

This might be the common and conventional way to alleviate the concern and ensure quality of analysis results. Of course this means extra time and resources. In particular, double programming for CDISC data sets then comparing and reconciling all differences might take a lot of time and require a good process and team in place to accomplish. More importantly the analysis result itself still has to be verified following the completion of this CDISC double programming process.

Approach 2. Raw Data Approach

This is the approach to verify the analysis results directly from the raw data sets. First of all it can be performed in parallel to the CDISC approach since it doesn’t depend on CDISC conventions – this would potentially save BIG on turn-around time. It is also an entirely independent process since it is independent from the CDISC conventions. Moreover, it not only verifies the analysis results but also verifies CDSIC conversions.  

Topline results usually requires a fast turnaround for good business causes. The raw data-based approach is worthy of consideration due to its efficiency and independency.