Structural biology of SARS-CoV-2 and implications for therapeutic development

Haitao Yang & Zihe Rao Nature Reviews Microbiology volume 19, pages685–700 (2021) Cite this article

Abstract

The COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is an unprecedented global health crisis. However, therapeutic options for treatment are still very limited. The development of drugs that target vital proteins in the viral life cycle is a feasible approach for treating COVID-19. Belonging to the subfamily Orthocoronavirinae with the largest RNA genome, SARS-CoV-2 encodes a total of 29 proteins. These non-structural, structural and accessory proteins participate in entry into host cells, genome replication and transcription, and viral assembly and release. SARS-CoV-2 proteins can individually perform essential physiological roles, be components of the viral replication machinery or interact with numerous host cellular factors. In this Review, we delineate the structural features of SARS-CoV-2 from the whole viral particle to the individual viral proteins and discuss their functions as well as their potential as targets for therapeutic interventions.

Introduction

Coronaviruses are enveloped viruses that possess a positive-sense single-stranded RNA genome 26–32 kb in length¹. Coronaviruses belong to the Coronaviridae subfamily Orthocoronavirinae. According to variations in the genome sequence and serological reactions, coronavirus members in the subfamily are classified into four genera: Alphacoronavirus, Betacoronavirus, Gammacoronavirus and Deltacoronavirus². Among them, Betacoronavirus is classified into five subgenera. Although infectious bronchitis virus was the first coronavirus isolated in chicken embryos in 1937 (ref.³), it was not until the 1960s that these viruses, particularly the human respiratory coronaviruses⁴, were characterized by electron microscopy. This subfamily of viruses has a unique structural feature on their surfaces which resembles a solar corona. This feature arises due to the presence of spike proteins on the virion surface.

Coronaviruses are characterized by high genetic recombination and mutation rates, which result in their ecological diversity⁵. They are able to infect and readily adapt to a wide range of hosts, from birds to whales. Seven coronaviruses have been found to infect humans. Human coronaviruses 229E, OC43, NL63 and HKU1 are responsible for 10–30% of upper respiratory tract infections annually, characterized by mild respiratory illnesses, such as the common cold⁶. By contrast, severe acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratory syndrome coronavirus⁷ and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are able to cause severe human respiratory diseases, potentially resulting in high mortality. In 2002–2003, SARS-CoV resulted in 8,096 reported cases and 774 deaths (case–fatality rate of ~10%)⁷. By the end of January 2020, 2,500 cases of Middle East respiratory syndrome and more than 800 associated deaths (case–fatality rate ~34%) were reported worldwide⁸. In late December 2019, clustered cases of a severe pneumonia were reported, and the aetiological agent was isolated and identified as a novel betacoronavirus, named SARS-CoV-2, that shares ~80% similarity in genome sequence with SARS-CoV⁹. SARS-CoV-2 causes COVID-19, with symptoms including fever, cough, fatigue, nausea and shortness of breath¹⁰. To date, there have been more than 160 million confirmed COVID-19 cases and more than 3 million related deaths worldwide¹¹.

To date, there has been a lack of effective therapies to treat COVID-19. Due to the rampant and continuous spread of COVID-19, it is a matter of urgency to identify and characterize drug and vaccine targets for SARS-CoV-2. The genome of SARS-CoV-2 is close to 30 kb on size, contains 14 open reading frames (ORFs) and encodes 29 viral proteins. Approximately two thirds of the 5′ end of the SARS-CoV-2 genome encodes two overlapping polyproteins: pp1a and pp1ab¹². These two polyproteins are digested by two viral proteases into 16 non-structural proteins (NSPs), which are essential for viral replication and transcription (Fig. 1a). Four ORFs at the 3′ terminus of the viral genome encode a canonical set of structural proteins that include the nucleocapsid (N), spike (S) protein, membrane (M) protein and envelope (E) protein, which are responsible for virion assembly and also participate in suppression of the host immune response. A series of accessory genes, which encode accessory proteins (ORF3a, ORF3b, ORF6, ORF7a, ORF7b, ORF8b, ORF9b and ORF14), lie between these structural genes. The accessory proteins are involved in regulating viral infection but may not be incorporated into the virion, except for the structural proteins ORF3a and ORF7a.

**Fig. 1: SARS-CoV-2 genome and life cycle.**

Briefly, in the first step of the SARS-CoV-2 life cycle, the S protein on the outer surface of the virion is responsible for binding to the host receptor or receptors for attachment to the cell membrane, which is followed by viral and host cellular membrane fusion and the release of viral genomic RNA into the cells. Subsequently, host ribosomes are hijacked to produce the two viral replicase polyproteins, which can further be processed into 16 mature NSPs through two virus-encoding proteases: main protease (M^pro) and papain-like protease (PL^pro). These NSPs are able to assemble into the replication and transcription complex (RTC) to initiate viral RNA replication and transcription. The genomic RNA and structural proteins then assemble into mature progeny virions, which are subsequently released through exocytosis to initiate another round of infection¹⁰ (Fig. 1b). Viral proteins can individually perform important physiological roles, constitute the viral protein machinery for specific essential events in the viral life cycle or extensively interplay with the cellular factors in the host immune response and pathogenesis¹³. In the following sections, we delineate the structural features of SARS-CoV-2 extending from the whole viral particle to individual proteins, including several antiviral drug targets, including the S protein, PL^pro, M^pro and viral RNA-dependent RNA polymerase (RdRP)¹⁴.

Structural proteins in the viral life cycle

S protein in viral entry

The S protein is a homotrimer, which protrudes from the virion and extensively decorates the viral surface like a crown. It is heavily glycosylated, belongs to the type I membrane-protein family and is anchored in the viral membrane, where it mediates fusion of the viral membrane with the host cell membrane¹⁵. In the native state, prefusion and postfusion conformations of S proteins can be traced simultaneously on the reconstructed virions. The SARS-CoV-2 S protein comprises ~1,200 residues and can be cleaved by a furin-like protease into two functional subunits, S1 and S2, which are responsible for mediating attachment to host cells and membrane fusion, respectively¹⁶. After cleavage during viral entry into the host cells, S1 and S2 remain associated with each other through non-covalent interactions. As shown by cryogenic electron microscopy (cryo-EM) (Fig. 2a), the S1 subunit of the SARS-CoV-2 S protein wraps around a threefold axis, covering the S2 subunit underneath¹⁷. The S1 subunit contains a receptor-binding domain (RBD) and an amino-terminal (N-terminal) domain (NTD). The RBD has a five-stranded antiparallel β-sheet core, flanked on either side by a short helix. The receptor-binding motif (RBM) extends out of the core (connecting β4 and β5), taking on a cradle-like structure for receptor binding. The RBM, which is stabilized by a disulfide bond, does not possess a regular secondary structure except for two small β-sheets. The RBD can adopt two distinct conformational states: the closed ‘down’ state and the open ‘up’ state¹⁷. In the ‘down’ state, RBD angles are close to the central cavity of the trimer to shield the receptor-binding regions, while in the ‘up’ state, the RBD undergoes hinge-like conformational movement, exposing its determinant regions to recognize the human angiotensin-converting enzyme 2 (hACE2) receptor on the host cellular membrane, the state of which is considered to be less stable than in the ‘down’ state. The NTD of the S protein adopts a galectin-like fold with a sugar-binding pocket and contains a ceiling-like structure on top. The NTD may recognize sugar moieties upon initial attachment and play a significant role in the transition of the conformation of the S protein. The S2 subunit comprises four conserved structural regions: a fusion peptide, two heptad repeats (HR1 and HR2) and a transmembrane region. The HR1 region constitutes the main helical stalk of S2, whereas the HR2 region is temporarily flexible in the prefusion state. The fusion peptide forms a short hydrophobic segment.

**Fig. 2: Structures of the SARS-CoV-2 spike protein in the presence or absence of antibodies.**

Undergoing a substantial structural rearrangement, from the metastable prefusion conformation to the postfusion conformation, the S protein fulfils its function in regulating the fusion of viral membrane with the host cell membrane¹⁸. Fusion is triggered when the S1 subunit binds to hACE2 (Fig. 2b,c). As observed in the complex structure, the N-terminal helix of hACE2 interacts with the outer surface of the RBM in the S1 subunit^19,20,21,22. The interaction involves 16 residues in the RBD and 20 residues in hACE2, which forms a network consisting of 14 hydrogen bonds and one salt bridge¹⁹. The binding of hACE2 to the RBD can lock the RBD in the ‘up’ conformation and trigger S1 shedding, which is mediated by the proteolytic cleavage of host TMPRSS2 and cathepsin B or cathepsin L. Thus, three HR1 helices of trimeric S2 interact with the pairing HR2 helices and constitute a stable six-helix bundle²³. In this unique helix bundle, three HR2 helices are packed into the hydrophobic grooves of the HR1-trimer core in an antiparallel manner. This conformational arrangement brings viral and host cell membranes into proximity and facilitates subsequent membrane fusion. Because of the indispensable function of the S protein, it is an attractive target for inhibition by neutralizing antibodies (nAbs), and characterization of the S protein structure provides atomic-level information for rational vaccine design.

S protein-neutralizing antibodies

nAbs targeting the SARS-CoV-2 S trimer have shown protection from viral infection in animal models and are being evaluated as therapeutics in humans. These antibodies comprise human monoclonal antibodies isolated from COVID-19 convalescent donors and single-domain antibodies (also known as nanobodies) which can bind novel epitopes, including buried cavities that are inaccessible to conventional antibodies. Determination of a number of structures of nAbs in complex with the S trimer has elucidated their modes of neutralization. Although some nAbs target the NTD or S2, most nAbs bind to the RBD, the latter of which can be further classified into four distinct classes (classes I, II, III and IV) on the basis of the nAb–RBD binding characteristics.

The nAbs in class I can bind to the RBD only in the ‘up’ state (Fig. 2e). They are expected to bind to the flat area on the top side of the cradle-like surface of the RBD, which extensively overlaps with the binding site for hACE2. Through direct competition with hACE2, nAbs in this class would produce steric hindrance when binding to RBD, blocking hACE2 attachment. CB6 (ref.²⁴), C105 (ref.²⁵), CV30 (ref.²⁶), B38 (ref.²⁷), CC12.1, CC12.3 (ref.²⁸), PR1077 (ref.²⁹) and P4A1 (ref.³⁰) nAbs belong to this class. Most contain IGHV3-53– or IGHV3-66-encoded heavy chains and utilize residues in complementarity-determining regions 1, 2 and 3.

The nAbs in class II also bind to the RBD in the ‘up’ state, but exhibit no overlap with hACE2-binding sites (Fig. 2f,g). CR3022 (ref.³¹), EY6A³² and nanobody VHH-72 (ref.³³) belong to this class. The binding region is located at the bottom of the RBD, and is spatially separated from the hACE2-binding sites. Structural analysis showed that the RBD undergoes a rotation that exposes the epitopes for these nAbs. Such a rearrangement is considered to cause a premature conversion of the S protein from the prefusion state to the postfusion state. The resulting unstable configuration of the S protein consequently inactivates SARS-CoV-2.

The nAbs in class III can bind to RBDs only in the ‘down’ conformation (Fig. 2h). They comprise Fab 2-4, Fab 2-43 (ref.³⁴) and BD23 (ref.³⁵). The heavy chains of the nAbs reach the RBD and interact with the cradle-like surface or the flexible ridge region. However, the binding pattern between these nAbs and the RBD is different from that for class I nAbs, according to the orientation change in the RBD, and the binding area becomes narrower. Notably, N-glycan chains are supposed to play a significant role in stabilizing the binding of class III nAbs to the ‘down’ RBD. Additionally, epitopes of some nAbs extend to the NTD, which may help to resist dynamic instability. Collectively, this binding mode would lock the RBD in the ‘down’ conformation, which also sterically hinders hACE2 access.

The nAbs in class IV can recognize both the ‘up’ RBD conformation and the ‘down’ RBD conformation (Fig. 2i,j). They comprise H11-D4, H11-H4 (ref.³⁶), P2B-2F6 (ref.³⁷), Ty1 (ref.³⁸), S309 (ref.³⁹), REGN10987 (ref.⁴⁰) and P17 (ref.⁴¹). Structural studies show that these nAbs target different regions. P2B-2F6 and the nanobodies H11-D4 and H11-H4 can bind to the top cradle-like surface in a similar orientation as class III nAbs. Their binding can be further reinforced by a protruding loop on the RBD. These three nAb epitopes are largely located on the opposite side of the RBM compared with the epitopes of class I nAbs. By partially overlapping with the hACE2-binding site, these nAbs sterically block hACE2 binding to the RBD as well. S309 targets a region distinct from the RBM. Its epitope comprises the α1 helix, a section of the β1 strand and two loops formed by residues 358–361 and 333–335. RGEN10987 is another class VI nAb that binds distal to the hACE2-binding site. The binding of this nAb would spatially hinder hACE2 attachment.

4A8 (ref.⁴²), COV57 (ref.²⁵), 2–17, 5–24, 4–8 (ref.³⁴) and FC05 (ref.⁴³) are nAbs that target other parts of the S protein. Structural analysis reveals that 4A8, which shows a high level of neutralization of SARS-CoV-2, recognizes the NTD and does not sterically hinder the binding between hACE2 and the S protein (Fig. 2d). Regarding the S2 subunit, only a few targeted monoclonal antibodies have been reported. Antibody 1A9 (ref.⁴⁴) has been found to interact with the S2 subunit but fails to neutralize SARS-CoV-2. In a recent report, the nAb CC40.8 was identified and found to neutralize SARS-CoV-2 and specifically recognize the S2 subunit⁴⁵. The discovery of non-RBD-targeted nAbs may benefit the strategy of nAb cocktail therapeutics.

Since SARS-CoV and SARS-CoV-2 share the same host cell receptor, hACE2, development of cross-neutralizing antibodies to both coronaviruses seems feasible. H014 (ref.⁴⁶) is a recently reported humanized antibody which efficiently neutralizes both SARS-CoV and SARS-CoV-2. It can recognize and interact with open RBDs, but the binding interface is located distinct from the RBM, and exhibits no competition with hACE2 attachment. Consistently, other cross-neutralizing antibodies (for example, VHH-72, ADI-56046 (ref.⁴⁷), COV21 (ref.²⁵) and CC6.33 (ref.⁴⁸)) also avoid the RBM and prefer to recognize the core domain of the RBD.

It is noteworthy that SARS-CoV-2 has a high mutation rate, and numerous mutant strains (variants) have been reported. Mutations in the S protein, especially the epitopes for nAbs, would attenuate the potency of nAbs. The D614G mutation is the most commonly reported mutation in the S protein⁴⁹, and results in increased infectivity and morbidity. The cryo-EM structure of the trimeric S protein with D614G demonstrated a conformational shift towards the hACE2-binding fusion-competent state⁴⁹ and exhibited attenuation of efficacy in nAb binding. N501Y is a mutant variant emerging from the United Kingdom, South Africa and Brazil⁵⁰. The mutation site is located at the RBD–hACE2 interface and has been experimentally shown to cause an increase in hACE2 affinity⁵¹. Other mutations worth noting include K417N and K417T, which appear in the epitopes of class I nAbs and are considered to affect the binding of class I antibodies. Mutations at residues in the NTD were also found in the new variants of concern, such as ΔY144 and Δ242–244. They were shown to abrogate neutralization of NTD-specific nAbs^52,53,54. Additionally, SARS-CoV-2 with the naturally occurring mutations to E484, F490, Q493 or S494 of the S protein was found to escape from potential therapeutic antibodies such as C121 and C144 (ref.⁵⁵). Combination treatment with two or more nAbs targeting distinct epitopes would be a strategy to suppress nAb escape variants.

E protein

After a coronavirus enters host cells, the E protein regulates viral lysis and the subsequent viral genome release. The E protein was found to be involved in viral assembly and budding by localizing to endoplasmic reticulum (ER) and Golgi body membranes². Moreover, the E protein has been shown to participate in activating the host inflammasome⁵⁶.

The structure of the SARS-CoV-2 E protein⁵⁷ solved by nuclear magnetic resonance spectroscopy shows that it is composed of a five-helix bundle ~35 Å in length (Fig. 3b). As the E protein can function as an ion channel, the pore inside the transmembrane region is predominantly occupied by hydrophobic residues except for the N-terminal pore. Owing to non-specific interhelical interactions, the entrance site at the N terminus is a drug target for inhibitor binding. The E protein is recognized topologically to be N_lumen–C_cyto (N-terminal ER–Golgi intermediate compartment lumen and carboxy-terminal (C-terminal) cytoplasm) and involved in regulation of pumping Ca²⁺ out of the ER, which may lead to activation of the cellular inflammasome, thereby enhancing the host antiviral response.

**Fig. 3: Structures of the SARS-CoV-2 nucleocapsid and envelope proteins.**

N protein

The N protein serves as the only structural protein inside the virion. It is a crucial component that protects the viral RNA genome and packages it into a ribonucleoprotein complex. A native reconstruction of SARS-CoV-2 using electron cryotomography suggests that a significant number of ribonucleoproteins may be membrane proximal. The N protein also plays a role in antagonizing the host immune response⁵⁸ and has been identified to counter cellular RNAi-mediated antiviral activities through its binding with double-stranded RNA ‘strings’⁵⁹, and can be regarded as a viral suppressor of RNA silencing. The N protein has potential as a target for vaccine development because it induces a severe immune responses during infection.

The N protein has two conserved structural domains, the NTD (N-NTD) and the CTD (N-CTD), each of which is independently folded⁶⁰. In the crystal structures of the N protein⁶¹, the N‐NTD exists as a monomer, whereas the N‐CTD exists as a dimer (Fig. 3a). The N‐NTD has the shape of a right‐handed fist and contains a four‐stranded antiparallel β‐sheet as a core subdomain. The loops protruding out of the core are positively charged, putatively to allow RNA binding. The N‐CTD homodimer forms a rectangular shape, with each protomer displaying a crescent shape. To stabilize the dimer interface, two β‐hairpin structures from each protomer can form four antiparallel β-strands by inserting themselves into each cavity. Compared with other coronaviruses, the N protein from SARS-CoV-2 displays different charge distributions in the N-terminal loop, the RNA protruding tip, the bottom of the N-NTD core and the N-CTD β-strand face. Hence, the variations in RNA binding to the N protein may further guide inhibitor optimization.

NSPs and inhibitors

Host translation shutdown by nsp1

nsp1 originates from the N-terminal cleavage of polypeptides pp1a and pp1ab by PL^pro. The biological functions of nsp1 manifest themselves mainly in virus–host interactions to suppress host translation^62,63, and thus nsp1 can be regarded as a canonical virulence factor. To hinder the host translation process, nsp1 is proposed to function by two mechanisms: the first is to bind the ribosomal 40S subunit during the initiation stage⁶⁴ and the second is to induce host mRNA degradation⁶⁵. Importantly, nsp1 does not impede viral protein expression while it binds to the mRNA 5′ untranslated region, leading to efficient viral translation and replication. The structure of nsp1 and the ribosomal 40S subunit has been determined to show the interactions between them and to explain the potential inhibition mechanism⁶⁶. In this cryo-EM structure, the C-terminal domain of nsp1 possesses a short α-helix which is connected to a longer α-helix through a short loop (Fig. 4a). Thus, the host mRNA entry channel is blocked by nsp1 insertion. This hypothesis is corroborated by the loss of host translation inhibition in the K164A–H165A double mutant. The long α-helix also contributes to the interactions between nsp1 and the ribosome. Through the shutdown of host translation, especially antiviral factors, nsp1 assists in evading immune defences, which suggests that disrupting nsp1–ribosome interactions is a plausible approach for SARS-CoV-2 drug discovery.

**Fig. 4: Structures of the SARS-CoV-2 nsp1 and nsp3 subdomains and PL^pro inhibitors.**

Multidomain protein nsp3

nsp3 consists of 10–16 domains depending on the coronavirus genus. Eight are present in all coronaviruses, including ubiquitin-like domain 1 (Ubl1), a hypervariable region, a macrodomain, ubiquitin-like domain 2 (Ubl2), a PL^pro, a zinc-finger domain, a Y1 domain and a CoV-Y domain⁶⁷. Most of the conserved domains perform essential functions in the life cycle of the virus. The macrodomains possess highly conserved structures and similar functions. Macrodomain Mac1 can cleave the phosphate group of ADP-ribose 1-phosphate and reverse protein ADP-ribosylation by hydrolysis. The core structure of Mac1 contains seven β-strands flanked by six α-helices (Fig. 4b). ADP-ribose interacts with the Mac1 hydrophobic cleft through conserved hydrogen bonds⁶⁸. This indicates that compounds targeting Mac1 may have broad-spectrum antiviral activities.

The ‘SARS-unique domain’ (SUD) participates in virus–host interactions. SUD has three subdomains: SUD-N (Mac2), SUD-M (Mac3) and SUD-C (DPUP). SUD-N and SUD-M adopt a macrodomain fold, whereas SUD-C has a frataxin‐like fold. Deletion of Mac2 decreases the viral replication rate to 65–70%, whereas Mac3 is indispensable for replication activity⁶⁹. PAIP1, which is a component of the eukaryotic translation machinery, has been identified to interact with SUD. The structure of the Mac2–PAIP1M (middle domain of PAIP1) complex shows that Mac2 displays a typical α/β/α macrodomain fold, whereas PAIP1M adopts a HEAT repeat fold⁷⁰. Strong complementarity which enhances complex stability is observed at the interface. This structure also supports the suggestion that Mac2–PAIP1M participates in regulating viral mRNA translation and is thus a good antiviral drug target.

PL^pro is located in nsp3 between SUD and a nucleic acid-binding domain. It cleaves the viral polyprotein precursors pp1a and pp1ab at three sites to produce NSPs nsp1, nsp2 and nsp3 (ref.⁷¹). Apart from viral polyproteins, PL^pro can also cleave host proteins to antagonize the innate immune response⁷². It preferentially recognizes and cleaves interferon-stimulated gene product 15 (ISG15) from interferon regulatory factor 3 (IRF3) and attenuates type I interferon responses, facilitating escape of the virus from the immune system⁷³. PL^pro is a 36-kDa cysteine protease with a catalytic triad⁷¹. It contains an N-terminal ubiquitin-like domain and a catalytic core domain⁷⁴. The catalytic core domain comprises three subdomains, the thumb, palm and fingers, which together fold like an open right hand. The thumb subdomain is composed of four α-helices, whereas the palm is formed by a six-stranded β-sheet. A four-stranded, twisted, antiparallel β-sheet makes up the finger subdomain. In the fingertip region, four cysteine residues constitute a zinc-finger motif, which coordinates a zinc ion with tetrahedral geometry. This zinc-finger is essential for structural integrity and protease activity.

The substrate-binding site is located in the solvent-exposed cleft between the thumb subdomain and the palm subdomain, which possess a catalytic triad composed of C111, H272 and D286. The substrate-binding site recognizes the consensus sequence LXGG↓X (the amino acid residues of the substrate are numbered P4–P3–P2–P1↓P1′–P2′ around the cleavage site, denoted by the downwards arrow). Subsites S1–S4 provide the binding sites for P1–P4, respectively⁷⁵. The S1 and S2 subsites are rather narrow, and can accommodate only glycine residues. The S3 subsite is partially solvent exposed but prefers positively charged and hydrophobic residues. The S4 subsite is relatively large and accommodates only hydrophobic residues. A flexible β-hairpin BL2 loop, which contains an unusual β-turn at Y268 and Q269, is involved in controlling substrate access to the active site. Consideration of the conformation of the BL2 loop may be important for rational drug design.

Besides the catalytic site, PL^pro harbors two distinct binding subsites (SUb1 and SUb2) for recognizing diubiquitin chains and ISG15. SUb1 recognizes one ubiquitin molecule of diubiquitin chains and the C-terminal ubiquitin-like domain of ISG15. SUb2 recognizes the other (K48-linked) uniquitin molecule and the N-terminal ubiquitin-like domain of ISG15 (refs^74,76) (Fig. 4d). As shown in the complex structures of PL^pro–ubiquitin and PL^pro–ISG15, SUb1 of SARS-CoV-2 PL^pro preferentially binds ISG15 through a different binding mode compared with uniquitin. Moreover, PL^pro SUb2 provides exquisite specificity for K48-linked diubiquitin chains, which makes diubiquitin a suitable substrate compared with monoubiquitin.

Inhibitors targeting PL^pro

Owing to the substantial role in mediating viral replication and suppressing the host immune response, PL^pro is an attractive target for antiviral drug development. Thousands of compounds, including approved drugs and molecules in clinical trials, have been screened against this target, but the hit rate is extremely low compared with that of drug leads that target M^pro, another viral protease encoded by SARS-CoV-2. The peptidomimetic inhibitors VIR250 and VIR251 were the first identified covalent inhibitors of PL^pro (ref.⁷⁷) (Fig. 4c). A catalytic residue, C111, of PL^pro engages in a Michael addition reaction with the β-carbon of the vinyl group of the vinylmethyl ester warheads from VIR250 and VIR251, resulting in the formation of a covalent thioether linkage. Residues at P2, P3 and P4 participate in an extensive network of hydrogen bonds and van der Waals interactions with their corresponding subsites. Similar substrate preferences and catalytic efficiencies are observed for SARS-CoV-2 and SARS-CoV PL^pro, suggesting that inhibitors of SARS-CoV PL^pro are a good starting point for lead compound optimization against SARS-CoV-2. GRL0617, an inhibitor of SARS-CoV PL^pro, also inhibits SARS-CoV-2 PL^pro (ref.⁷⁸). Structural studies show that GRL0617 fits in the substrate cleft which was formed between the BL2 loop and the loop connecting α3 and α4, where it occupies the S3 and S4 subsites. The aromatic ring of GRL0617 fits into the S3 subsite, while the naphthalene group fills the S4 subsite. Thus, the binding of GRL0617 blocks the substrate from gaining access to the active site. Inspired by the success of GRL0617, several naphthalene-based compounds were synthesized and also show good inhibition of SARS-CoV-2 PL^pro (ref.⁷⁹). YM155, an anticancer drug candidate in clinical trials, has also been shown to inhibit SARS-CoV-2 PL^pro and has potent antiviral activity (half-maximal effective concentration (EC₅₀) of 170 nM)⁸⁰. YM155 achieves such a strong inhibition by simultaneously recognizing three hotspots in PL^pro. The first binding site is located at the entrance of the substrate-binding pocket and blocks substrate entry to the active site. The second is located on the thumb domain and hampers interactions between PL^pro and ISG15. The third site is located on the zinc-finger motif, and the binding perturbs the stability of the zinc-finger motif and enzyme activity.

M^pro

M^pro is the major protease encoded by SARS-CoV-2. It cleaves replicase polyproteins at no fewer than 11 sites to release NSPs, allowing the assembly of the viral replication and transcription machinery. The pivotal role that M^pro plays in regulating viral replication and transcription makes it an attractive drug target. Crystal structures show that this 306 amino acid protease comprises three domains (domain I, residues 10–99; domain II, residues 100–182; and domain III, residues 198–303) and adopts a chymotrypsin-like fold⁸¹. Due to the similar substrate specificity and presence of a cysteine as a catalytic residue, M^pro is classified as a 3C-like protease⁸².

Since the first crystal structure of SARS-CoV-2 M^pro in complex with a Michael acceptor inhibitor N3 (Protein Data Bank accession code 6LU7) (Fig. 5a) was published⁸¹, many structures of M^pro in complex with inhibitors have been reported. SARS-CoV-2 M^pro functions as an active homodimer, in which the two protomers are nearly perpendicular to each other. The N-terminal finger (residues 1–7) of one protomer inserts itself between domains II and III of its neighboring protomer, and promotes the formation of the dimer and the S1 subsite in the neighboring protomer⁸³. Dimerization is additionally regulated by domain III through a salt-bridge interaction between E290 of one protomer and R4 from its adjacent protomer. In each protomer, a deep cleft between domains I and II forms the substrate-binding site, with a catalytic dyad (H41 and C145) at its Centre. Domain III contains five α-helices that arrange themselves into a large antiparallel globular cluster and exhibit a unique topology in coronaviruses. Domains II and III are connected by a long loop (residues 183–198).

**Fig. 5: Structures of SARS-CoV-2 M^pro and its inhibitors.**

Coronavirus M^pros recognize the P4–P1′ positions of the substrate^84,85,86 (Fig. 5b). The S1 subsite has an absolute preference for glutamine at P1. P2 is usually a bulky side chain that can be accommodated by the deep hydrophobic S2 subsite. The P3 side chain is solvent exposed, and the corresponding S3 subsite also shows tolerance to a wide range of functional groups. The hydrophobic S4 subsite is smaller than S2 and thus accommodates residues with small side chains. This binding pocket is highly conserved among coronavirus M^pros, suggesting that antiviral inhibitors targeting this pocket should have broad-spectrum activity against coronaviruses in general⁸⁷.

Inhibitors of SARS-CoV-2 M^pro

Recently, numerous inhibitors of M^pro have been identified exhibiting a range of binding mechanisms (Fig. 5c). N3 is the representative peptidomimetic inhibitor, and harbours a Michael acceptor as a warhead and substituents spanning all substrate-binding subsites. The Michael acceptor forms a covalent bond with the active site residue, C145. N3 bears a lactam ring, an aliphatic isobutyl group, an isopropyl group, a methyl group and an isoxazole as the side chain for the P1–P5 sites, respectively. The lactam ring, which replaces glutamine at the P1 site, exhibits favorable binding at the S1 subsite^81,88. Studies have shown that N3 displays strong inhibition of M^pros from different coronaviruses, and it could inhibit SARS-CoV-2 with EC₅₀ of 16.77 μM in a Vero cell-based assay. This value may not be truly representative of activity as it is not clear whether the high levels of expression of the efflux transporter P-glycoprotein in Vero cells affected the evaluation of its antiviral efficacy⁸⁸.

A recent study reported a series of α-ketoamides that inhibit SARS-CoV-2 M^pro (ref.⁸⁹). Distinct from the previously designed α-ketoamides, the P2–P3 amide bond is replaced with a pyridone ring, which increases the half-life in plasma. Replacement of the P2 cyclohexyl moiety with smaller cyclopropyl increases the antiviral activity against betacoronaviruses. Approved hepatitis C virus drugs, such as boceprevir, telaprevir and narlaprevir, are α-ketoamide inhibitors and also exhibit inhibition of SARS-CoV-2 M^pro. The ketone group undergoes a nucleophilic attack by the C145 thiolate to form a hemithioketal. Because boceprevir, telaprevir and narlaprevir are peptidomimetic inhibitors with similar structures, they form very similar interactions with the S1′–S4 subsites⁹⁰. Another ketone-based potent inhibitor was discovered in the hydroxymethylketone class⁹¹. One of the hydroxymethylketone derivatives demonstrated inhibition of SARS-CoV-2 M^pro and also possesses antiviral activity with EC₅₀ of 4.8 μM.

Another study presented two peptidomimetic aldehydes (named ‘11a’ and ‘11b’) which bear an indole moiety at the N terminus (P3 site) and an aldehyde warhead at the C terminus⁹². The complex structures show that the aldehyde groups covalently bind to C145 of the catalytic dyad to inhibit M^pro activity. Both inhibitors exhibited excellent inhibition of SARS-CoV-2 M^pro with half-maximal inhibitory concentrations of 0.053 μM and 0.040 μM, respectively. The inhibitors also exhibited strong anti-SARS-CoV-2 infection activity in Vero cell-based assays and good pharmacokinetic and toxicity properties. A recent study reported another series of aldehyde derivatives with EC₅₀ ranging from 7.6 to 748.5 nM in cell-based assays. In a transgenic mouse model of SARS-CoV-2 infection, oral or intraperitoneal treatment with two compounds, MI-09 or MI-30, significantly reduced lung viral loads and lung lesions. Both also displayed good pharmacokinetic properties and safety in rats⁹³. GC376, an inhibitor of feline infectious peritonitis virus in preclinical studies, has been found to efficaciously inhibit SARS-CoV-2 in Vero cells by targeting M^pro. It utilizes an aldehyde bisulfite to covalently bind to C145 (refs^94,95). Based on 11a, 11b and GC376, a number of aldehyde-based dipeptidyl and tripeptidyl inhibitors of M^pro were designed, and the organocatalyst-mediated protein aldol ligation to C145 of the protease occurs⁹⁶. A series of M^pro inhibitors that possess an aldehyde group for covalent inhibition have been reported⁹⁷. Among them, two compounds inhibited SARS-CoV-2 replication in cultured primary human airway epithelial cells.

The repurposing of approved drugs, drug candidates and pharmacologically active compounds provides an alternative approach to identify potential drug leads that could rapidly be approved as clinical treatments for COVID-19. Through high-throughput screening, one study identified multiple drug leads that target M^pro, including ebselen, disulfiram and carmofur⁸¹. Ebselen exhibited antiviral activity in a plaque-reduction assay (EC₅₀ = 4.67 μM). As an organoselenium compound, ebselen was previously investigated for treatment of bipolar disorders and hearing loss⁹⁸. It has been shown to have low cytotoxicity in humans in clinical trials⁹⁹. Ebselen has been approved by the US Food and Drug Administration to enter phase II clinical trials (NCT04484025 and NCT04483973) for COVID-19 treatment. Carmofur, which also exhibited antiviral activity in vitro, is a derivative of 5-fluorouracil. It is an approved antineoplastic agent, and has been investigated as a cancer treatment¹⁰⁰. As observed in the complex structure of M^pro and carmofur, the catalytic C145 residue is covalently bound to the carbonyl reactive group of carmofur and its fatty acid tail extends into the hydrophobic S2 subsite¹⁰¹. Such a novel inhibitory mode makes carmofur a good lead compound for rational drug design. GRL-1720 and 5h were also identified as covalent inhibitors targeting M^pro through high-throughput screening. Crystal structures show that both GRL-1720 and 5h form extensive interactions with C145 and other residues in the M^pro active site¹⁰².

A recent study performed large-scale fragment screening against M^pro by combining mass spectrometry and X-ray approaches¹⁰³. Seventy-one hits were identified to bind at the substrate-binding site, and three hits were found to bind near the dimer interface. These structures provide a starting point to design more elaborate and potent drug leads that target SARS-CoV-2 M^pro. Another study performed a high-throughput X-ray crystallographic screening of two drug repurposing libraries (the Fraunhofer IME Repurposing Collection and the Safe-in-Man library from Dompé Farmaceutici) against the SARS-CoV-2 M^pro (ref.¹⁰⁴); the study authors identified 37 compounds that bind to M^pro. In subsequent cell-based assays, one peptidomimetic compound (calpeptin) and six non-peptidic compounds showed antiviral activity at non-toxic concentrations. Additionally, two allosteric binding sites representing potential targets against SARS-CoV-2 were identified. The first allosteric site is in the immediate vicinity of the S1 pocket of the adjacent protomer within the native dimer. The second allosteric site is formed by the deep groove between the catalytic domain and the dimerization domain.

Baicalin and baicalein, which are natural products derived from the flowering plant Scutellaria baicalensis, have been shown to inhibit SARS-CoV-2 M^pro with half-maximal inhibitory concentrations of 6.41 μM and 0.94 μM, respectively¹⁰⁵. The structure of M^pro in complex with baicalein shows that the phenyl ring with three hydroxy groups forms π–S and π–π interactions with C145 and H41 of the catalytic dyad, while the hydroxy groups form multiple hydrogen bonds with the S1 subsite. The distal phenyl ring occupied the S2 subsite. Another example is shikonin¹⁰⁶. The complex structure shows that shikonin forms a hydrogen bond network with the catalytic dyad C145 and H164 located in the S1 subsite. The aromatic head groups of shikonin form a π–π interaction with H41 on the S2 subsite. The hydroxy and methyl groups of the isohexenyl side chain of the shikonin tail form hydrogen bonds with R188 and Q189, respectively, in the S3 subsite. Such a unique mode of action expands our knowledge of M^pro inhibition.

Replication and transcription complex

Replication mechanism of the central RTC

In coronavirus infection, replication and transcription is regulated through a multisubunit mechanism¹⁰⁷, where the RdRP nsp12 catalyses viral RNA synthesis and thus acts as the key component of the RTC¹⁰⁸. In addition, the primase nsp8 (ref.¹⁰⁹) and an auxiliary factor, nsp7, contribute to the activation and continuous production of viral RNA¹¹⁰. nsp12 along with nsp7 and nsp8 makes up the complete RdRP complex.

SARS-CoV-2 nsp12 is composed of three major domains, a nidovirus RdRP-associated nucleotidyltransferase (NiRAN) domain, an interface domain and a right-handed RdRP domain (finger, palm and thumb)¹¹¹ (Fig. 6a). The active site of SARS-CoV-2 RdRP is located in the palm subdomain, which has a shape like other RNA polymerases, such as those from hepatitis C virus ns5b¹¹² and poliovirus 3Dpol¹¹³. The architecture of the central cavity is shared by other conserved polymerases involving the primer-template entry, nucleoside triphosphate (NTP) entry and nascent strand exit paths. Residues D760 and D761 are involved in the coordination of two Mg²⁺ ions essential for polymerase activity. One Mg²⁺ ion coordinates motif C and binds at the 3′ end (‘i’ site) of the RNA primer, facilitating the condensation reaction in RNA chain synthesis, while the second Mg²⁺ positions the incoming NTP and stabilizes the charge environment. Separate from conserved motifs A–E at the active site, motif F and motif G inside the fingers subdomain are conducive to guiding the RNA template. During viral RNA synthesis, notable structural rearrangements occur in this complex to accommodate the RNA¹¹⁴. Along with the product chain synthesis, the protruding RNA template–product duplex exits through the active site without steric hindrance and extends to two positively charged ‘sliding poles’ formed by two nsp8 N-terminal helices¹¹⁵ (Fig. 6b). Consistent with SARS-CoV nsp8 adopting variable conformations^116,117, N-terminal extensions of nsp8-2 (the second copy of nsp8) have two different orientations at the early replicating stage. In one orientation, it is adjacent to the finger subdomain, whereas in the other orientation, it interacts with the RNA duplex, suggesting that nsp8 may have regulatory functions in replication initiation. The complex consisting of nsp12, nsp7, nsp8 and RNA duplex reflects the replicating state in RdRP activity; therefore, it is referred to as the central RTC (C-RTC).

**Fig. 6: Structures of SARS-CoV-2 replication and transcription complex and its inhibitors.**

RNA elongation, capping and backtracking

The RTC needs to guarantee processive RNA duplex elongation without template–product dissociation so that viral genome or subgenome synthesis can be rapidly completed inside the host cell¹¹⁸. For coronaviruses, which have the largest known positive-sense RNA genomes, both replication efficiency and replication fidelity are essential for maintaining genetic integrity. The former relies on the functional elongation RTC (E-RTC), whereas the latter depends on proofreading by nsp14. An E-RTC is composed of a C-RTC and two coupled copies of the nsp13 helicase: nsp13-1 and nsp13-2 (ref.¹¹⁹) (Fig. 6c). nsp13 is believed to be crucial in viral replication and the mRNA capping process, which includes unwinding of the RNA duplex into single strands, 5′ to 3′ polarity formation and RNA 5′-triphosphatase activity^120,121. The unique domains of coronavirus nsp13, such as the zinc-binding domain, the stalk and the 1B domain, are all important for helicase activity¹²². In the structure of E-RTC, two nsp13 zinc-binding domains form extensive interactions with two nsp8 N-terminal helices. In particular, the zinc-binding domain from nsp13-2 forms additional interactions with the nsp12 thumb subdomain, stabilizing the overall structure during elongation^119,123. Before entering the nsp12 active site, the template RNA strand undergoes disruption of RNA secondary structure and guidance between the nsp13-2 RecA domain and the 1B domain to ensure the 5′ to 3′ translocation direction¹²⁴. Structural characterization of E-RTC not only helps elucidate the RNA elongation mechanisms but also suggests different functional roles that nsp13 may play in this event. In nsp13-2, residues N361 in the domain 1A, S468, T532 and D534 in the domain 2A and R178 and H230 in the domain 1B collectively contribute to template RNA recognition and elongation, demonstrating that nsp13-2 is directly involved in positioning downstream template RNA. Interestingly, the interactions between the nsp13-1 1B domain and the nsp13-2 1B domain have been shown to play a pivotal role in E-RTC helicase activity, even though nsp13-1 is far from nsp13-2 (ref.¹²³) (Fig. 6d). Therefore, nsp13-1 is indispensable for RNA elongation in that it is cooperatively coupled with nsp13-2 in the functioning E-RTC.

The capping modification of mRNA, which rigorously follows subgenomic mRNA synthesis, is essential for viral translation and propagation, mRNA protection and escape from host immune response^125,126. Similarly to the RNA elongation process, multiple NSPs participate in RTC assembly during sequential stages of mRNA capping, which can be divided into four main steps: (1) removal of the γ-phosphate of 5′-pppA by nsp13 with RNA 5′-triphosphatase activity¹²⁰; (2) transfer of GMP to 5′-ppA by the nsp12 NiRAN domain with guanylyltransferase (GTase) activity, leading to the generation of a GpppA cap structure¹²⁷; (3) methylation of N7-guanine by nsp14, which has N7-methyltransferase activity¹²⁸; and (4) methylation of the ribose 2′-O nucleotide into the final ^7MeGpppA_2′OMe cap structure by nsp1, which has 2′-O-methyltransferase activity¹²⁹. Multiple NSPs are assembled into the RTC in order according to their functional roles, a process which is accompanied by structural conformational changes. On one hand, the nsp12 NiRAN domain is involved in the second step to catalyze the ppA to GpppA transfer through its newly identified GTase activity. On the other hand, an intermediate state which has been captured by cryo-EM, shows that nsp9 can inhibit the GTase activity by tight insertion into the NiRAN catalytic centre in order to terminate the reaction (Fig. 6e). nsp9 is an RNA-binding protein, which is characterized by a positively charged groove¹³⁰. This groove, together with a β-hairpin at the nsp12 N terminus, provides an exit path for postcatalytic GpppA-RNA. Several hydrophobic interactions and hydrogen bonds enhance nsp9 binding to nsp12, suggesting that nsp9 plays a substantial role in the viral life cycle. Because it has been shown that disruption of the nsp9–nsp10 cleavage site is not lethal¹³¹ and nsp10 is able to tightly bind to nsp14 or nsp16 (refs^132,133), nsp9 may serve as a core regulator in recruiting the nsp10–nsp14 or nsp10–nsp16 complex for the following capping RTC assembly with N7-methyltransferase activity and 2′-O-methyltransferase activity.

Another important aspect relating to the RTC is its proofreading mechanism. Most RNA viruses replicate with estimated error rates between 10⁻³ and 10⁻⁵, which results in approximately one mutation per genome per round of replication for a typical ∼10-kb genome¹³⁴, a much higher mutation rate than occurs in cellular DNA replication¹³⁵. The lower fidelity may largely be due to the lack of proofreading activity in these viruses. By contrast, SARS-CoV-2, which encodes nsp14 (an exonuclease with proofreading activity), can maintain high fidelity during replication of its large genome. Proofreading involves the backtracking of mismatched template–product RNA chains. The single-stranded 3′ segment of the product RNA generated by backtracking extrudes through the RdRP NTP entry tunnel. Then a mismatched nucleotide located at the 3′ end of product RNA enters the conserved NTP entry tunnel to initiate backtracking, and meanwhile, nsp13 stimulates RdRP backtracking. The structure of C-RTC in complex with the essential nsp13 helicase and RNA suggests that the helicase can facilitate the backtracking mechanism¹³⁶ (Fig. 6f).

RdRP inhibitor discovery

The RdRP is a prime drug target for SARS-CoV-2 (Fig. 1a). Inhibition of RdRP activity will prevent viral replication and can potentially achieve clinical efficacy. Major efforts have been devoted to identify both nucleotide and non-nucleotide inhibitors, which have also been used as probes to understand the replication cycle of SARS-CoV-2 and to provide a basis for development of broad-spectrum antiviral drugs.

The prodrug remdesivir, which was initially developed for the treatment of Ebola virus infection, shows good activity against SARS-CoV-2 in in vitro assays¹³⁷ but limited efficacy in clinical trials. In the cell, remdesivir is phosphorylated to remdesivir triphosphate, enabling it to act as an ATP analogue. The structure of pretranslocated catalytic C-RTC clearly demonstrates the incorporation mode of remdesivir and suggests its inhibition mechanism¹¹⁴ (Fig. 6g). Kinetic analysis shows remdesivir triphosphate is preferred as a substrate over ATP¹³⁸ and terminates product chain elongation at a delayed position (i + 3). Once the inserted remdesivir monophosphate is transferred to the i + 3 position, the distance between the serine hydroxy oxygen from S861 and the 1′-cyano nitrogen from remdesivir monophosphate will be close to 2 Å, causing ‘delayed chain termination’. Further investigations indicate that an remdesivir-induced translocation barrier and RdRP stalling occur after the addition of three nucleotides upon incorporation of remdesivir into the product chain¹³⁹. Favipiravir is another nucleoside analogue that has been approved as an anti-influenza virus drug in Japan. Favipiravir simulates the incorporation of ATP and GTP into the product RNA, yet it inhibits viral proliferation by increasing the mutation rate of the viral genome rather than causing product chain terminations¹⁴⁰. The structure of the RdRP–favipiravir complex delineates a precatalytic state and identifies the conserved residues for favipiravir recognition (Fig. 6h).

Although nucleotide inhibitors can be inserted into RNA chains, they can later be cleaved by proofreading activity. Thus, non-nucleotide inhibitors have been considered as an alternative approach for drug development. Suramin, a century-old drug used to treat African sleeping sickness and river blindness, can effectively inhibit SARS-CoV-2 polymerase activity with at least 20-fold more activity than RDV-3Pi in biochemical assays and inhibits viral replication in vitro¹⁴¹. In the cryo-EM structure, two suramin molecules bind to the active sites of nsp12, with one occupying the template-binding site and the other occupying the primer catalytic active centre, implying that suramin may competitively inhibit protein–RNA binding due to its strong electronegativity (Fig. 6i). However, the highly negatively charged suramin has the potential to bind to many positively charged macromolecular surfaces, and thus its specific antiviral activity remains to be further investigated.

Accessory protein–host interactions

ORF3a, ORF9b, ORF7a and ORF8

ORF3a protein, encoded by ORF3a, is an ion channel membrane protein with 274 amino acids. It forms a potassium-sensitive channel and may promote virus release. The cryo-EM structure of SARS-CoV-2 ORF3a is the first viroporin family structure determined in coronaviruses¹⁴². The overall structure shows that ORF3a forms a dimer with the ion channel decorated with charged residues for cation conduction (Fig. 7a). It is noteworthy that ORF3a has a TRAF-binding domain at the N terminus that can activate NF-κB and the NLRP3 inflammasome¹⁴³, suggesting an important role in the host immune response. As ion channels are important therapeutic targets and many ion-channel drugs have already been approved for clinical trials, ORF3a is another good antiviral drug target¹⁴⁴.

**Fig. 7: Structures of SARS-CoV-2 accessory proteins.**

ORF9b is encoded by an alternative ORF within the N protein gene. ORF9b suppresses the type I interferon immune response by interacting with the mitochondrial import receptor subunit TOM70. Targeting the interactions between ORF9b and TOM70 has been proposed as a therapeutic option for SARS-CoV-2. The structure of SARS-CoV-2 ORF9b shows that it is dimeric, with each protomer composed mainly of β-strands¹⁴⁵ (Fig. 7b). The centre of the dimer has a hydrophobic environment for accommodating lipid molecules and membrane attachment.

ORF7a is a type I transmembrane protein and is also involved in virus–host interactions and protein trafficking within the ER and Golgi body. Its structure shows that it has a seven-stranded β-sandwich fold consistent with the immunoglobulin superfamily¹⁴⁶ (Fig. 7c). A deep hydrophobic pocket has been identified for potential inhibitor binding.

ORF8 is an accessory protein that is composed of 121 amino acids. It has an N-terminal signal sequence and adopts an immunoglobulin-like fold¹⁴⁷ (Fig. 7d). The structure of ORF8 shows that it can form a dimer, and each promoter of ORF8 contains eight antiparallel β-strands tied by three disulfide bonds. The covalently bonded dimer structure is stabilized by surface hydrophobic interactions and a series of hydrogen bonds. ORF8 is capable of assembling itself into large-scale homologous complexes; however, the oligomerization mechanism needs to be investigated further.

Conclusions

Coronaviruses have the largest genomes among all RNA viruses, encoding structural proteins and NSPs that achieve sustainability in a wide variety of ecological niches and hosts. Evolving viral proteins help coronaviruses to achieve host recognition and entry, genome replication, assembly and release of progeny viruses, and host immune surveillance evasion. In response to the COVID-19 pandemic, great efforts have been devoted to structural studies of SARS-CoV-2 proteins and viral–cellular protein complexes using X-ray crystallography and cryo-EM. Among them, the S protein, M^pro, PL^pro and RdRP are the most widely studied drug targets. A multidisciplinary combination of structural virology, ’omics technologies, immunology and virology will produce a more effective approach to structure-aided design of vaccines and therapeutics that have the potential for clinical use.