Development and Psychometric Evaluation of the FACE-Q Satisfaction with Appearance Scale




Satisfaction with appearance and improved quality of life are key outcomes for patients undergoing facial aesthetic procedures. The FACE-Q is a new patient-reported outcome (PRO) instrument encompassing a suite of independently functioning scales designed to measure a range of important outcomes for facial aesthetics patients. FACE-Q scales were developed with strict adherence to international guidelines for PRO instrument development. This article describes the development and psychometric evaluation of the core FACE-Q scale, the Satisfaction with Facial Appearance scale. Both modern and traditional psychometric methods were used to confirm that this new 10-item scale is a reliable, valid, and responsive measure.


Key points








  • Accurate and reliable measurement of patient-centered outcomes is critical to ongoing practice improvement and clinical research in facial aesthetics.



  • Modern psychometric methods overcome the limitations of traditional psychometric methods by providing clinically meaningful interval-level data.



  • The FACE-Q Satisfaction with Facial Appearance scale is a new-generation condition-specific patient-reported outcome instrument, capable of providing clinically meaningful and scientifically sound data reflecting patient perceptions of outcome.






Background


Facial aesthetics procedures are an important area of continued growth in plastic surgery; 13.8 million cosmetic procedures were performed in the United States in 2011, an increase of 5% from 2010. Rhinoplasty (n = 244,000) and blepharoplasty (n = 196,000) were second and third to breast augmentation (n = 307,000) in popularity. Botulinum toxin type A (n = 5.7 million), soft tissue fillers (n = 1.9 million) and chemical peels (n = 1.1 million) were the top three cosmetic minimally invasive procedures.


Specially designed questionnaires known as patient-reported outcome (PRO) instruments, developed to measure a range of outcomes (eg, symptoms, satisfaction, body image, and quality of life), have become a mainstay of clinical research in all areas of medicine and surgery. To provide meaningful measurement, such PRO instruments must be shown to be reliable, valid, and responsive ( Table 1 ). Although understanding the patient’s perspective is especially important in facial aesthetics, a systematic review performed by our team identified that there is a lack of reliable and valid PRO instruments available for measuring the range of issues important to facial aesthetic patients. We therefore set out to develop a new PRO instrument following the methodology we previously used to develop other plastic surgery–specific PRO instruments. This new PRO instrument is called the FACE-Q and includes a range of separate scales that measure important outcomes for patients having any type of facial cosmetic surgery, minimally invasive cosmetic procedure, or facial injectable.



Table 1

Glossary of terms














































Term Definition
Ad hoc questionnaire A PRO instrument that has not been developed and/or validated using acknowledged guidelines. Such PRO instruments may pose clinically reasonable questions, but one cannot be confident about their reliability (ie, ability to produce consistent and reproducible scores) or validity (ie, ability to measure what is intended to be measured)
Conceptual framework The expected relationships of items within a domain and between domains within a PRO concept. The validation process confirms the conceptual framework
Domain A domain is a collective word for a group of related concepts. All the items in a single domain contribute to the measurement of the domain concept
Generic questionnaires PRO instruments that can be used in any patient group regardless of their health condition, and allow direct comparisons across disease groups and/or healthy populations. An example of a generic questionnaire is the Short Form 36-Item Health Survey, which is the most widely used generic measure in the world
Health-related quality of life In quality-of-life measurement, the terms quality of life, health status, health-related quality of life, and functional status are often used interchangeably. Although there is a lack of conceptual clarity regarding these terms, there is broad agreement on the core minimum set of health concepts that should be measured. These concepts include physical health, mental health, social functioning, role functioning, and general health perceptions
Item An individual question, statement, or task that is evaluated by the patient to address a particular concept
PRO instrument A questionnaire used in a clinical or research setting in which responses are collected directly from patients. These questionnaires quantify aspects of health-related quality of life and/or significant outcome variables (eg, patient satisfaction, symptoms) from the patient’s perspective. PRO instruments provide a means of quantifying the way patients perceive their health and the impact treatments have on their quality of life
Reliability An important property of a PRO instrument because it is essential to establish that any changes observed in patient groups are attributable to the intervention or disease and not to problems in the measure. Test-retest reliability may be evaluated by having individuals complete a questionnaire on more than 1 occasion over a time period when no changes in outcome are expected to have occurred. Commonly reported reliability statistics include the Cronbach alpha and intraclass correlation coefficients
Responsiveness The ability of an instrument to accurately detect change. Responsiveness is an important psychometric property when evaluating change as the result of a health care intervention or when following patients over time. Responsiveness is usually examined by comparing preintervention and postintervention scores using standardized change indicators, such as effect size statistics
Scale The system of numbers or verbal anchors by which a value or score is derived. Examples include visual analog scales, Likert scales, and rating scales
Scientific soundness Refers to the demonstration of reliable, valid, and responsive measurement of the outcome of interest
Score A number derived from a patient’s response to items in a questionnaire. A score is computed based on a prespecified, validated scoring algorithm and is subsequently used in statistical analyses of clinical study results. Scores can be computed for individual items, domains, or concepts, or as a summary of items, domains, or concepts
Validity The ability of an instrument to measure what is intended to be measured. Establishment of validity may be considered an ongoing process. A PRO instrument is examined from various angles, including an assessment of the development process, consideration of known group differences, evaluation of internal consistency, and evaluation of both convergent and discriminant validity relative to other existing related measures

Adapted from Food and Drug Administration. Patient reported outcome measures: use in medical product development to support labeling claims. 2009;11:31–3. Available at: www.fda.gov/cber/gdlns/prolbl.pdf ; and Cano S, Klassen A, Pusic A. The science behind quality-of-life measurement: a primer for plastic surgeons. Plast Reconstr Surg 2009;123:99–102e; with permission.


This article describes the development and psychometric evaluation of the core FACE-Q scale, called the Satisfaction with Facial Appearance scale.




Background


Facial aesthetics procedures are an important area of continued growth in plastic surgery; 13.8 million cosmetic procedures were performed in the United States in 2011, an increase of 5% from 2010. Rhinoplasty (n = 244,000) and blepharoplasty (n = 196,000) were second and third to breast augmentation (n = 307,000) in popularity. Botulinum toxin type A (n = 5.7 million), soft tissue fillers (n = 1.9 million) and chemical peels (n = 1.1 million) were the top three cosmetic minimally invasive procedures.


Specially designed questionnaires known as patient-reported outcome (PRO) instruments, developed to measure a range of outcomes (eg, symptoms, satisfaction, body image, and quality of life), have become a mainstay of clinical research in all areas of medicine and surgery. To provide meaningful measurement, such PRO instruments must be shown to be reliable, valid, and responsive ( Table 1 ). Although understanding the patient’s perspective is especially important in facial aesthetics, a systematic review performed by our team identified that there is a lack of reliable and valid PRO instruments available for measuring the range of issues important to facial aesthetic patients. We therefore set out to develop a new PRO instrument following the methodology we previously used to develop other plastic surgery–specific PRO instruments. This new PRO instrument is called the FACE-Q and includes a range of separate scales that measure important outcomes for patients having any type of facial cosmetic surgery, minimally invasive cosmetic procedure, or facial injectable.



Table 1

Glossary of terms














































Term Definition
Ad hoc questionnaire A PRO instrument that has not been developed and/or validated using acknowledged guidelines. Such PRO instruments may pose clinically reasonable questions, but one cannot be confident about their reliability (ie, ability to produce consistent and reproducible scores) or validity (ie, ability to measure what is intended to be measured)
Conceptual framework The expected relationships of items within a domain and between domains within a PRO concept. The validation process confirms the conceptual framework
Domain A domain is a collective word for a group of related concepts. All the items in a single domain contribute to the measurement of the domain concept
Generic questionnaires PRO instruments that can be used in any patient group regardless of their health condition, and allow direct comparisons across disease groups and/or healthy populations. An example of a generic questionnaire is the Short Form 36-Item Health Survey, which is the most widely used generic measure in the world
Health-related quality of life In quality-of-life measurement, the terms quality of life, health status, health-related quality of life, and functional status are often used interchangeably. Although there is a lack of conceptual clarity regarding these terms, there is broad agreement on the core minimum set of health concepts that should be measured. These concepts include physical health, mental health, social functioning, role functioning, and general health perceptions
Item An individual question, statement, or task that is evaluated by the patient to address a particular concept
PRO instrument A questionnaire used in a clinical or research setting in which responses are collected directly from patients. These questionnaires quantify aspects of health-related quality of life and/or significant outcome variables (eg, patient satisfaction, symptoms) from the patient’s perspective. PRO instruments provide a means of quantifying the way patients perceive their health and the impact treatments have on their quality of life
Reliability An important property of a PRO instrument because it is essential to establish that any changes observed in patient groups are attributable to the intervention or disease and not to problems in the measure. Test-retest reliability may be evaluated by having individuals complete a questionnaire on more than 1 occasion over a time period when no changes in outcome are expected to have occurred. Commonly reported reliability statistics include the Cronbach alpha and intraclass correlation coefficients
Responsiveness The ability of an instrument to accurately detect change. Responsiveness is an important psychometric property when evaluating change as the result of a health care intervention or when following patients over time. Responsiveness is usually examined by comparing preintervention and postintervention scores using standardized change indicators, such as effect size statistics
Scale The system of numbers or verbal anchors by which a value or score is derived. Examples include visual analog scales, Likert scales, and rating scales
Scientific soundness Refers to the demonstration of reliable, valid, and responsive measurement of the outcome of interest
Score A number derived from a patient’s response to items in a questionnaire. A score is computed based on a prespecified, validated scoring algorithm and is subsequently used in statistical analyses of clinical study results. Scores can be computed for individual items, domains, or concepts, or as a summary of items, domains, or concepts
Validity The ability of an instrument to measure what is intended to be measured. Establishment of validity may be considered an ongoing process. A PRO instrument is examined from various angles, including an assessment of the development process, consideration of known group differences, evaluation of internal consistency, and evaluation of both convergent and discriminant validity relative to other existing related measures

Adapted from Food and Drug Administration. Patient reported outcome measures: use in medical product development to support labeling claims. 2009;11:31–3. Available at: www.fda.gov/cber/gdlns/prolbl.pdf ; and Cano S, Klassen A, Pusic A. The science behind quality-of-life measurement: a primer for plastic surgeons. Plast Reconstr Surg 2009;123:99–102e; with permission.


This article describes the development and psychometric evaluation of the core FACE-Q scale, called the Satisfaction with Facial Appearance scale.




Qualitative and quantitative methods


We obtained local institutional ethics review board approval before commencing our study. The content for the Satisfaction with Facial Appearance scale was developed as part of a larger suite of scales that cover a range of concepts important to facial aesthetics patients. These scales were constructed with strict adherence to recommended guidelines for PRO instrument development. The guidelines outline three phases required to develop a scientifically credible and clinically meaningful tool.


In the first phase, a conceptual framework is formally defined, and a pool of items is generated. These items are developed from the following three sources: review of the literature, qualitative patient interviews, and expert opinion. The item pool is developed into a series of scales that are pilot tested in the target participant sample to clarify ambiguities in item wording, confirm appropriateness, and determine acceptability and completion time. This phase of our research is described in a separate publication and is summarized later in this paper. In the second phase (the main focus of this article), the scales undergo psychometric evaluation in a large sample of target subjects. Questions representing the best indicators of outcome are retained based on their performance against a standardized set of psychometric criteria. In the third phase, further psychometric evaluation is performed by administering the item-reduced scales to a large sample of participants to further examine their scientific soundness.


Phase 1: Qualitative Phase


Qualitative interviews were conducted with 50 patients recruited from 7 offices of plastic surgeons and dermatologists practicing in New York (United States) and Vancouver (Canada) between January 2008 and February 2009. Participants ranged in age from 20 to 79 years (mean age 51 years) and had undergone 1 or more of the following facial procedures: botulinum toxin (n = 20), resurfacing (n = 15), filler (n = 15), blepharoplasty (n = 25), facelift (n = 22), rhinoplasty (n = 9), neck lift (n = 8), brow lift (n = 4), and chin implant (n = 2).


Patients were interviewed using open-ended questions. Interviews were digitally recorded and transcribed verbatim and coded within NVivo8 software using a line-by-line coding approach. Data collection and analysis took place concurrently to gather data to refine emerging codes and categories. Data analyses led to the development of a conceptual framework that depicts important concepts for facial aesthetic patients ( Fig. 1 ).




Fig. 1


FACE-Q conceptual framework.


To develop scales with items covering the concepts in Fig. 1 , we examined codes (ie, key phrases expressed by patients) and linked these to specific patient characteristics (eg, type of procedure, age, and gender). Attaching key patient characteristics to each code provided the information needed to develop core items (common to all patients), and unique items (specific to a subgroup). To develop a set of scales, we then iteratively and interactively examined the item lists developed from the coded material to identify a set of items that mapped out a continuum for each major concept. For each item we examined Flesch-Kincaid grade level scores and adjusted as necessary to ensure the lowest possible grade level for reading. Scale instructions and appropriate response options were then developed for each scale.


The scales were then presented to 26 experts (15 plastic surgeons, 4 dermatologists, 3 psychologists, 4 office staff) to further appraise and refine. In addition, 35 facial aesthetic patients participated in one-on-one cognitive debriefing interviews to identify any ambiguous wording and confirm appropriateness, acceptability, and completion time of the preliminary scales. The process resulted in the development of a set of independently functioning scales that measure the concepts forming the conceptual framework ( Table 2 ).



Table 2

FACE-Q scales

























































































Appearance appraisal scales Facial appearance overall a ,
Skin
Lines overall
Forehead lines
Forehead and eyebrows
Lines between eyebrows
Eyes (overall, double eyelid, upper and lower eyelids)
Crow’s feet
Eyelashes
Cheekbones
Cheeks
Ears
Nasal bridge
Nose
Nasolabial folds
Lips
Lip lines
Marionette lines
Chin
Lower face/jawline
Under Chin
Neck
Quality of life scales Psychological wellbeing
Social well-being
Age appraisal
Expectations and motivations
Psychological distress
Recovery early life impact
Adverse effect checklists for treatment Recovery early symptoms
Skin
Forehead, scalp and eyebrows
Eyes
Nose
Lower face and neck
Lips
Ears
Process of care scales Decision
Doctor
Information
Office staff
Office appearance

a see Table 4 for scale’s content.


Relevant scales for all patients.



Phase 2: Quantitative Phase


Data were collected and analyzed to identify the items representing the best indicators for each scale based on their performance against a standardized set of psychometric criteria. Data came from 2 separate studies, and were compiled for the purpose of psychometric analyses. Results presented in this article relate only to the Satisfaction with Facial Appearance scale. This scale was developed for use in research and clinical practice to compare outcomes across procedure types and/or to measure change before and after any facial aesthetic procedure. Future publications will present psychometric findings for the other FACE-Q scales.


Study 1


Data were collected from patients of 10 plastic surgeons and 2 dermatologists representing 10 different practices in the United States (New York, Washington, St Louis, Dallas, and Atlanta) and Canada (Vancouver) between June 2010 and June 2012. Eligible participants included those who could read English; were 18 years of age or older; and were planning to undergo, or had already undergone, any surgical or nonsurgical facial aesthetic procedure.


Given the large number of FACE-Q scales that were developed in the initial phases of research, we grouped scales into booklets based on common surgical and nonsurgical procedures and distributed these to the participating practices. All booklets included the Satisfaction with Facial Appearance scale. Instructions for this scale asked patients to answer a series of items based on “how you look right now” and to complete each item with their “entire face in mind.” The 4 response options were as follows: very dissatisfied, somewhat dissatisfied, somewhat satisfied, and very satisfied. Patient responses to items in each scale are converted to a summary score which ranges from 0 to 100. A higher score indicates higher satisfaction or better quality of life.


Patients from 6 surgical practices were recruited at the time of their appointment and asked to complete a questionnaire booklet in the waiting room before their appointment. Patients from 4 practices were invited to participate in a postal survey. To ensure a high response rate, a personalized letter from the relevant health care provider was included with the appropriate FACE-Q booklet and up to 3 mailed reminders were sent as necessary. All patients invited into the study were given a gift card ($5) to thank them for their participation.


Study 2


A medical device company was provided with the Satisfaction with Facial Appearance scale alongside other FACE-Q scales relevant to measuring the concerns of patients having facelifts for a clinical trial involving 100 patients from France, Germany, the United Kingdom, and Israel. Patients completed FACE-Q scales before and after surgery. MAPI (MArchés et Prospectives Internationaux [International Prospects and Markets in English]) Research Trust provided translations and linguistic validation of the FACE-Q scales. This process ensured that the concepts measured by the FACE-Q scales are equivalent across languages (ie, English, German, French, and Hebrew) and easily understood by the people in the target country. In brief, MAPI uses a process based on translation principles as detailed by the European Regulatory Issues and Quality of Life Assessment (ERIQA) group and the International Society of Pharmacoeconomics and Outcomes Research and recommended by the US Food and Drug Administration.


Rasch measurement theory and analysis


We analyzed the Satisfaction with Facial Appearance scale data using Rasch measurement theory methods. These methods are increasingly used in health outcome research. Unlike traditional methods, Rasch analysis indicates the extent to which rigorous measurement is achieved by examining the difference (or fit) between the observed scores (patients’ responses to items) and the expected values predicted from the data by a single mathematical model called the Rasch model. The criteria for measurement in Rasch analysis are evaluated interactively using the Rasch model. Thus, a range of evidence is used to evaluate each questionnaire item in a scale. This evidence is then used to make a judgment about the overall quality of the scale.


Rasch analyses were performed on the Satisfaction with Facial Appearance scale using RUMM2030 software. Results were interpreted using published criteria wherever possible as follows:


Item fit validity


The items of the Satisfaction with Facial Appearance scale must work together (fit) as a conformable set both clinically and statistically. When items do not work together (misfit) in this way, it is inappropriate to sum item responses to reach a total score, and the validity of a scale is questioned. Three main indicators were examined to assess item fit :



  • 1.

    Log residuals (item-person interaction)


  • 2.

    Chi-square values (item-trait interaction)


  • 3.

    Item characteristic curves



There are no absolute criteria for interpreting fit statistics. It is more meaningful to interpret them together and in the context of their clinical usefulness as an item set. However, as a guide, fit residual should be between −2.5 and +2.5 with associated nonsignificant chi-square values (significance interpreted after Bonferroni adjustment).


Each of the items of the Satisfaction with Facial Appearance scale has multiple response categories (ie, very dissatisfied, somewhat dissatisfied, somewhat satisfied and very satisfied), which reflect an ordered continuum. Although this ordering may seem clinically sensible at the item level, it must also work together when the items are combined to form a set. Item fit validity analysis tests this statistically and graphically by threshold locations and plots. As such, the threshold values between adjacent pairs of response options for each item are expected to be ordered by magnitude (less to more). Thresholds are visible in graphical plots, in which the highest areas of the probability distributions of each response category should not be below adjacent category plots. When response options work as expected, important evidence for the validity of the scale is obtained.


Targeting


Scale-to-sample targeting concerns the match between the range of satisfaction with facial appearance measured by the Satisfaction with Facial Appearance items and the range of satisfaction with facial appearance as reported by a sample of patients. Targeting can be observed by examining the spread of person and item locations (ie, define the relative distributions of transformed total scores against the locations of the individual items across the continuum of satisfaction with facial appearance) in these two relative distributions. Targeting analysis informs about how suitable the sample is for evaluating the Satisfaction with Facial Appearance scale and how suitable the scale is for measuring the sample. Better targeting equates to a better ability to interpret the psychometric data with confidence.


Reliability


Person measurements (estimates) are examined with the Person Separation Index (PSI), a reliability statistic that is comparable with the Cronbach alpha. The PSI quantifies the error associated with the measurements of people in a sample. Higher PSI values indicate better reliability (>0.70 indicates adequate reliability ).


Stability


Scale performance (specifically item performance) should be stable across clinically important scenarios in which systematic differences between subgroups that may lead to bias in responding to items are not expected. Stability analysis enables an explicit test of scale performance in the form of an examination of differential item functioning (DIF). We examined DIF for gender, age, and ethnicity. As a guide, statistically significant chi-square values indicate potential DIF and therefore problems in scale performance (significance interpreted after Bonferroni adjustment).


Traditional psychometric methods analysis


Traditional psychometric methods primarily use correlation or descriptive analyses to evaluate scaling assumptions (legitimacy of summing items) and scale reliability and validity, which are described in detail elsewhere. We examined data quality (percent missing data for each item), scaling assumptions (similarity of item means and variances; magnitude and similarity of corrected item-total correlations ), scale-to-sample targeting (score means; standard deviation [SD]; floor and ceiling effects), and internal consistency reliability (Cronbach alpha, homogeneity coefficients ).


Responsiveness analysis


The responsiveness of the Satisfaction with Facial Appearance scale to detect clinical change was examined in the largest subgroup in our sample (patients having facelifts) at the group level by comparing pretreatment and posttreatment Rasch transformed scores using paired t -tests and calculating the following 2 standard indicators: effect size (ES) calculations (Kazis ES ); and standardized response mean (SRM). Larger ESs/SRMs indicate greater responsiveness, and it is standard practice to interpret the magnitude of the change using Cohen arbitrary criteria (0.20, small; 0.50, moderate; and 0.80, large). Preliminary minimal importance difference (MID) values were generated as follows: (1) calculating half standard deviation of the pretreatment mean score, and (2) extrapolation of a change score based on a 0.5 ES.


The responsiveness of the Satisfaction with Facial Appearance scale was also compared at the individual person level. This change score was achieved by computing, for each person, the significance of their own change in measurement (sig change). First, we computed a change score for each person (before surgery to after surgery). Second, we computed the standard error associated with each person’s change score (ie, the square root of the sum of the squared standard error values before and after surgery). Third, we computed the significance of the change for each person by dividing their change score by the standard error of the difference (SE diff ; ie, how large was their change in standard error units). Fourth, we categorized the significance of each person’s change score into 1 of 5 groups according to the size and direction of the change score. We then counted the numbers of people achieving each level of significance of change. The formulae are as follows:


Sig change = Postsurgery transformed score − Presurgery transformed score SE diff

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Nov 20, 2017 | Posted by in General Surgery | Comments Off on Development and Psychometric Evaluation of the FACE-Q Satisfaction with Appearance Scale

Full access? Get Clinical Tree

Get Clinical Tree app for offline access