Measuring Outcomes in Breast Surgery




The purpose of this article is to update and inform plastic surgeons regarding the available tools to evaluate outcomes after breast surgery. An overview of the current literature on clinician-reported outcomes, patient-reported outcomes, and national outcome audits is provided.


Key points








  • Clinician-reported rating scales used to assess cosmetic outcomes after breast surgery usually demonstrate poor to moderate inter-rater reliability, regardless of the type of scale used to assess the outcome.



  • Although the validity of the Baker classification system for capsular contracture severity has been demonstrated, its reliability remains unknown.



  • A scoring system designed to assess the severity of breast skin injury after mastectomy and postmastectomy reconstruction is now available: the Mastectomy SKIN Score.



  • Further research is needed to evaluate the measurement tools used to assess donor site morbidity after breast reconstructive surgery.



  • The BREAST-Q is a well-developed, validated, breast surgery–specific, patient-reported outcome (PRO) measure that can be used to evaluate patients’ perception of outcome after breast reduction, augmentation, or reconstruction.



  • The American College of Surgeons National Surgical Quality Improvement Project (NSQIP) is a national outcomes database that can be used to evaluate surgical outcomes after tissue flaps, breast reductions, and breast reconstruction.






Introduction


In recent years, increased emphasis has been placed on measuring clinical outcomes and improving quality of care in the United States. As a result, a growing number of outcomes-based measures have been developed to assist providers and health care organizations in measuring their performance. This review article provides a summary of the available tools to help surgeons and hospitals achieve measurable improvements in



  • 1.

    Patient satisfaction after plastic surgery of the breast (ie, breast reduction, augmentation, and reconstruction)


  • 2.

    Overall quality of care for breast surgical patients





Clinician-reported outcomes


Before the recent advances in PRO measurement, clinician-reported outcome (ClinRO) tools constituted the basis of outcomes assessments in plastic surgery. ClinROs are successfully used to report endpoints that cannot be directly reported by a patient. The main drawback of using a ClinRO tool in plastic surgery is that although a given tool may reliably measure a particular outcome through the eyes of the assessor, it typically does not measure how a patient perceives this outcome. In the setting of breast surgery, this is especially important because surgery is directed toward a restoration or improvement in breast form as realized by the patient. Furthermore, existing data suggest that there may be considerable interobserver variability not only among providers but also between providers and patients when measuring subjective parameters. For instance, in the oncologic literature, several reports have demonstrated some discordance between what providers measure and what patients consider important when assessing toxicity symptoms of cancer treatments. In breast surgery, such differences must be recognized by the surgeon preoperatively to set realistic patient expectations and maximize postoperative satisfaction. To that effect, ClinRO tools, when used in combination with validated PRO instruments, still play a role in measuring provider-patient discordance when assessing subjective outcomes after breast surgery.


Cosmetic Outcome After Breast Surgery


The evaluation of aesthetic outcome after breast surgery is by nature a highly subjective process. A literature review identified 2 scales used frequently in recent studies for breast cosmetic assessment by the provider:



  • 1.

    Harvard scale


  • 2.

    ABNSW scoring system, supported by the Japanese Breast Cancer Society



The Harvard scale is a simple ordinal scale consisting of 4 categories: excellent, good, fair, and poor. Four-point scales evaluating the results of breast reconstruction tend to have unacceptable inter-rater reliability because raters use subjective guidelines to characterize each given category.


The ABNSW scoring system contains 5 subscales:




  • A , A symmetry



  • B , B reast shape



  • N , N ipple deformation



  • S , S kin condition



  • W , W ound scar



For each category, a score from 0 to 3 is established as follows:




  • 3: Excellent—both breasts have a similar appearance



  • 2: Good—there are a few differences between both breasts but only on close observation



  • 1: Fair—there are marked differences between the breasts from a distance



  • 0: Poor—there are severe, unattractive differences between the breasts



A total score is then calculated based on the 5 items, where the cosmetic outcome is deemed “Excellent” when the total score is 15 points, “Good” when the total score is between 11 and 14, “Fair” when it is between 6 and 10, and “Poor” when it is less than 6. This type of scale tends to have higher inter-rater reliability because raters follow specific guidelines to characterize each given category, and there is less subjective interpretation.


Reliability of outcome results


When using such instruments, demonstration of the reliability of the results, with acceptable intra-rater and inter-rater reliability, is essential because cosmetic assessments may vary depending on the evaluator or the timing of the assessment. Intra-rater reliability can be calculated using the kappa statistic, which gives the reader a quantitative measure of the magnitude of agreement between raters. Values of the kappa statistic range from −1 (complete disagreement) to +1 (perfect agreement), with a value of 0 representing exactly what is expected by chance. Kappa statistic values are interpreted as follows :




  • Less than 0.40, poor agreement beyond chance



  • 0.40 to 0.75, Fair to good agreement



  • Greater than 0.75, excellent agreement above chance



In a recent study by Leonardi and colleagues, the investigators evaluated the impact of medical specialty and provider gender on aesthetic evaluation after autologous breast reconstruction with and without radiation therapy. Raters used the Harvard scale and a numeric scale from 0 (worst result) to 10 (best result) for evaluation of cosmetic outcomes. Overall, there was moderate inter-rater reliability and significant differences among specialties when using a binary classification system of positive/negative judgment. Plastic surgeons’ opinions had the most reliable level of agreement (κ = 0.60 vs κ = 0.45 and κ = 0.48 for radiation oncologists and breast surgeons, respectively). Female breast surgeons consistently gave the lowest scores, followed by female radiation oncologists. Regardless of gender, plastic surgeons gave the most uniform opinion and the most favorable aesthetic scores. The analysis using the Harvard scale provided poor to fair inter-rater reliability. Other studies evaluating cosmetic ClinROs after breast reconstruction usually demonstrate poor to moderate inter-rater reliability, regardless of the type of scale used to assess the outcome.


Capsular Contracture


The Baker classification system is the most commonly used and widely accepted rating scale used to grade the severity of implant capsular contracture in the setting of both cosmetic breast augmentation and postmastectomy implant-based reconstruction. This rating scale, a ClinRO instrument, uses clinical examination to guide the assessment of the degree of firmness ± distortion around an implant.


Although the Baker system was originally developed to evaluate capsule formation in the setting of augmentation mammaplasty, modification of the classification system was subsequently applied to better classify capsule formation in the setting of prosthetic breast reconstruction. More specifically, the distinction between the 2 systems was highlighted to address that a significant proportion of reconstructed breasts have what is considered a detectable implant, due not necessarily to capsular contracture but rather to the lack of overlying soft tissue.


For example, in the setting of augmentation mammaplasty:



  • 1.

    Class I represents a natural looking breast in which the implant is not detectable.


  • 2.

    Class II assumes some degree of capsular contracture in the augmented breast because the implant is detectable.



In the setting of implant-based reconstruction, class I is subdivided into 2 subgroups: class IA represents a natural looking breast in which the implant is not visible; class IB describes a soft but visible implant secondary to the performance of the mastectomy. By contrast, class II in the modified system represents an implant with mild firmness. In both systems, a class IV contracture represents a symptomatic breast that is excessively firm to the touch and as such is believed to require surgical intervention ( Table 1 ).



Table 1

Baker classification systems


















Original Baker Classification of Capsular Contracture After Augmentation Mammaplasty
Class I Breast absolutely natural; no one could tell breast was augmented.
Class II Minimal contracture; surgeon can tell surgery was performed but patient has not complaint.
Class III Moderate contracture; patient feels some firmness.
Class IV Severe contracture; obvious just from observation.





















Classification of Capsular Contracture After Prosthetic Breast Reconstruction
Class IA Absolutely natural, cannot tell breast was reconstructed.
Class IB Soft, but the implant is detectable by physical examination or inspection because of mastectomy.
Class II Mildly firm reconstructed breast with an implant that may be visible and detectable by physical examination.
Class III Moderately firm reconstructed breast. The implant is readily detectable, but the result may still be acceptable.
Class IV Severe contracture with an unacceptable aesthetic outcome and/or significant patient symptoms requiring surgical intervention.


Although both classification systems are straightforward and easy to use, neither of these systems have undergone formal evaluation that may be applied to rating scales. For example, a literature review reveals that there are no published studies that formally evaluate the Baker classification system for reliability (ie, the extent to which it gives consistent results) or responsiveness to change. Furthermore, concern exists that because the measurement of capsular contracture severity is based on a provider’s subjective observation, it may be imprecise and vulnerable to provider/observer bias.


There is evidence, however, which supports the validity of the scale or the degree to which it measures what it is supposed to measure. For example, Zahavi and colleagues compared the clinical assessment of capsular contractures to a radiologic thickness of the capsule, as evaluated by both ultrasound (US) and MRI. A total of 20 patients, with 27 implants, was evaluated in the study. A positive correlation was found between capsular thicknesses as evaluated by either US or MRI and the clinical Baker score, with P values of 0.002 and 0.017, respectively. More specifically, a Baker score of I or II had a thinner capsule, averaging 1.14 mm compared with a Baker score of III or IV, which averaged 2.39 mm. The same correlation was found with MRI, where a Baker score of I or II correlated with a capsule measuring 1.39-mm thick, compared with a Baker score of III or IV, which correlated with a thicker capsule averaging 2.62 mm.


A similar study was undertaken to investigate long-term histologic changes in the environment of breast implants and their correlation at the time of capsular contracture defined by the Baker score. The collagenous capsules of 53 silicone breast implants from 43 patients were evaluated histologically for capsular thickness. A significantly higher degree of the Baker score was found with increasing capsular thickness ( P <.009).


Although these latter methods of capsular contracture evaluation may provide more objective measurements, they are neither cost effective nor appropriate for routine evaluation of capsule development. There have been several attempts to objectively measure capsular contracture with instruments designed to measure deformability, using tools such as compression clippers or tonometry. These methods, which may allow for a more precise measurement tool, may be less valid, because the overlying skin and subcutaneous tissue ± surrounding breast must also be compressed by the instrument, which adds a confounding variable. Thus, in spite of its limitations, the Baker classification system remains the gold standard approach to capsular contracture severity grading at the present time.


Breast Skin Ischemic Injury


With increasing adoption of immediate breast reconstruction (IBR) after skin-sparing mastectomy (SSM) and nipple-sparing mastectomy (NSM), breast skin ischemic injury is a frequently encountered complication. Mastectomy skin necrosis—an umbrella term commonly (and sometimes erroneously) used in the scientific literature to describe a spectrum of postoperative breast skin ischemic injuries—has been the focus of many recent publications. Breast skin ischemic injury after mastectomy and IBR may be associated with profound consequences secondary to the breakdown of the skin barrier: it may postpone the initiation of adjuvant therapy due to delayed wound healing and potentially lead to reoperation for prosthetic infection and reconstruction failure. The incidence of postoperative breast skin ischemia or necrosis is difficult to estimate, however, due to the lack of a standardized method to characterize the severity of ischemic tissue injury.


Mastectomy SKIN Score


A newly developed, simple scoring system assessing the severity of breast skin injury after mastectomy and IBR is now available: the Mastectomy SKIN Score (unpublished report). The Mastectomy SKIN Score was developed at Mayo Clinic by a group of breast surgeons and plastic surgeons as part of a quality improvement initiative seeking to reduce the morbidity of SSM or NSM with IBR. It is patterned after the established scoring system for burn injuries (ie, a scoring system that takes into account both the depth and the extent of tissue injury). When using the Mastectomy SKIN Score, a score with 2 components is assigned to each operated breast ( Table 2 ):



  • 1.

    A letter score on a 4-point scale for depth of skin ischemic injury



  • plus


  • 2.

    A numeric score on a 4-point scale for the surface area of the deepest skin ischemic injury



Table 2

Mastectomy SKIN Score. Each breast receives both a number and a letter score to characterize the severity of breast skin ischemic injury, based on 2 characteristics: (1) the greatest depth of tissue ischemic injury and (2) the surface area involved of the area of greatest depth. The breast mound and nipple-areolar complex are scored separately
































Depth of Tissue Ischemic Injury Surface Area Involved
Score Definition Score Definition
A No evidence of skin ischemia or necrosis 1 None
B Color change of skin suggesting impaired perfusion or ischemia (may be cyanosis or erythema) 2 Change involving 1%–10% of breast skin or 1%–10% of NAC
C Partial thickness skin necrosis resulting in at least epidermal sloughing 3 Change involving 11%–30% of breast skin or 11%–30% of NAC or total nipple involvement a
D Full-thickness skin necrosis b 4 Change involving > 30% of breast skin or > 30% of NAC

Abbreviation: NAC, nipple-areolar complex.

a Because the nipple itself is considered a key to the aesthetics of the breast, if there is skin necrosis involving the entire nipple, the surface area score of the NAC is automatically upgraded to a score of at least 3, even if the nipple represents less than 10% of the surface area of the NAC.


b Note: areas that are not definitely full thickness should be scored as partial thickness.



The categories for depth of ischemic tissue injury include the extremes of no skin injury and full-thickness necrosis. Two intermediate categories for depth of injury are appointed: one for partial thickness necrosis that must demonstrate evidence of partial thickness necrosis by at least epidermal sloughing and a milder category of skin color change suggestive of ischemia (this could include cyanosis or erythema) in the absence of any findings of actual tissue necrosis. For assessing the size of the area involved, a parallel structure of 4 categories is selected:



  • 1.

    None


  • 2.

    1%–10%


  • 3.

    11%–30%


  • 4.

    Greater than 30%



In a retrospective review of consecutive patients who underwent SSM or NSM with IBR, postoperative photograph scores using the Mastectomy SKIN Score, including both the letter score (ie, depth of tissue ischemic injury) and the numeric score (ie, surface area involved) as well as their combinations, strongly correlated with reoperation outcomes. The combined letter and numeric score demonstrated a C statistic of 0.97 for SSM and 0.94 for NSM, for predicting need for additional surgical intervention after breast skin ischemic injury (unpublished data). It is currently in the process of validation.


Donor Site Morbidity


In the setting of postmastectomy reconstruction, which procedure is chosen is often based on patient preferences for health outcomes and their risk profile. The most common type of autologous tissue breast reconstruction uses the abdominal donor to reconstruct the breast. Newer surgical techniques designed to decrease abdominal wall morbidity have led to an increasing number of women seeking microvascular transverse rectus abdominis myocutaneous (TRAM) or perforator flap reconstruction rather than traditional pedicled TRAM flap reconstruction. The relative benefit of these individual flaps remains, however, controversial.


Despite the lack of consensus in the literature regarding abdominal donor site morbidity in TRAM or related flap reconstruction patients, few argue that the donor site is often a source of patient anxiety and concern. Adequate levels of abdominal muscular strength are necessary to engage in daily activities, such as lifting, performing sports-related activity, and maintaining erect posture. That said, outcome data that evaluate potential donor site morbidity after autogenous tissue reconstruction should be incorporated into the informed consent process.


To date, the bulk of published data regarding abdominal donor site morbidity has used objective measures to evaluate isolated abdominal muscle function. The most common measurement tool used for this purpose is a dynamometer. Dynamometers are devices that measure force or power. Several classes of dynamometers are available for measuring isometric, isotonic, or isokinetic muscle strength. The isometric dynamometer requires an individual to push or pull maximally against the recording device without movement taking place. The isotonic dynamometer requires lifting as much weight as possible through a full range of motion. The isokinetic meter controls the speed of movement during maximal contraction while measuring the force applied.


Dynamometers are frequently used to assess neuromuscular function because they can, in theory, provide objective, detailed torque and velocity data—variables that can be used to calculate overall muscle strength. There is evidence, however, to suggest that the reliability of dynamometry depends on the muscle groups tested. For example, Agre and colleagues found better reliability of isokinetic testing with upper limb muscles compared with lower limb muscles. To the best of the authors’ knowledge, no group has evaluated the reliability of dynamometry in the breast reconstructive population. In a population of patients with chronic low back pain, reliability testing revealed highly significant learning effects for isometric trunk flexion and isokinetic abdominal measurements. It has also been suggested that the use of a warm-up before and/or rest period between repeated measurements may influence the fatigability of a muscle and ultimately in the maximal strength recorded. It follows that using repeat measures to evaluate truncal function with these measurement tools may be problematic.


Perhaps given these issues, as well as the expense and complexity involved in using a dynamometer, other groups have simply used ability to do sit-ups as a surrogate measurement for formal strength testing. More formal grading systems, such as the Kendall and the Lacote grading system, which similarly use clinical examination and manual muscle testing to evaluate function, have also been used in this setting ( Table 3 ). Hamdi and colleagues, for example, evaluated 20 consecutive patients after 17 unilateral and 3 bilateral deep inferior epigastric perforator flap reconstructions for the purpose of measuring their abdominal wall function preoperatively and at 3 and 6 months postoperatively using a muscle grading system. Their results suggest that all patients had reached or even improved their preoperative level of upper and lower rectus muscle function 6 months after the operation. The external oblique muscles were the most affected by the procedure of flap harvesting, but only 2 patients were found to have a measurable impairment after 6 months.


Nov 20, 2017 | Posted by in General Surgery | Comments Off on Measuring Outcomes in Breast Surgery

Full access? Get Clinical Tree

Get Clinical Tree app for offline access